Incident: Twitter Hyperlinks Break Due to Human Error at Hosting Firm

Published Date: 2012-10-08

Postmortem Analysis
Timeline 1. The software failure incident happened on the evening of October 7, 2012 [15038].
System 1. t.co domain name system 2. .CO Internet's ClientHold status 3. Melbourne IT's domain policy team 4. Twitter's t.co link shortening service 5. Dyn's domain name system connectivity service
Responsible Organization 1. Melbourne IT's domain policy team inadvertently placed the t.co domain on hold while responding to a phishing complaint, causing the software failure incident [15038]. 2. .CO Internet, which had the ability to put the t.co domain on ClientHold status, also played a role in the incident [15038].
Impacted Organization 1. Twitter users - The software failure incident impacted millions of Twitter users who received "non-existent domain" errors when trying to follow links due to the t.co domain going offline [15038]. 2. Companies using t.co links - Other companies such as Zappos and Etsy that utilize Twitter's t.co link shortening service were also affected by the software failure incident [15038].
Software Causes 1. Human error at a Melbourne-based hosting firm responding to an abuse complaint led to the outage that broke hyperlinks on Twitter [15038]. 2. The t.co domain being placed on hold inadvertently by Melbourne IT's policy team while actioning a phishing complaint caused the failure incident [15038]. 3. An issue with the upstream parent zone, .co, the country code domain for Colombia, contributed to the failure incident [15038].
Non-software Causes 1. The outage that broke hyperlinks on Twitter was caused by a simple human error at a Melbourne, Australia-based hosting firm responding to an abuse complaint [15038]. 2. The issue was initially thought to be caused by Dyn, a company providing domain name system connectivity, but it was later identified as an issue with the upstream parent zone, .co, the country code domain for Colombia [15038]. 3. The t.co domain was put on ClientHold status by .CO Internet, which is a special status usually reserved for customers who don't pay their bills on time [15038].
Impacts 1. The software failure incident caused hyperlinks on Twitter to break, leading to millions of users receiving "non-existent domain" errors when trying to follow links [15038]. 2. The incident highlighted the fragility of complex systems, demonstrating how a single point of failure can disrupt services even for a short period of time [15038]. 3. Users in Asia and Australia were affected by the outage, while it occurred during the night in North and South America and before most of Europe was awake and online [15038].
Preventions 1. Implementing stricter verification processes for actions taken on critical domain names like t.co could have prevented the incident. This would involve ensuring that only authorized and well-trained personnel can make changes to such important domains [15038]. 2. Having redundant systems or failover mechanisms in place for the t.co domain could have mitigated the impact of the failure. This would involve having backup servers or alternative routing methods to handle link shortening in case the primary system goes offline [15038]. 3. Conducting thorough testing and simulations of potential failure scenarios could have helped identify the central point of failure introduced by routing all outbound links through t.co. By proactively identifying weaknesses in the system, appropriate measures could have been taken to strengthen the infrastructure [15038].
Fixes 1. Implement stricter verification processes to prevent human errors like the one that occurred at Melbourne IT [15038]. 2. Enhance monitoring and alert systems to quickly identify and rectify any issues with the t.co domain to minimize downtime [15038]. 3. Diversify the domain name system connectivity providers to reduce reliance on a single point of failure [15038]. 4. Develop a backup system or alternative routing mechanism for links in case the t.co domain experiences issues in the future [15038].
References 1. Melbourne IT spokesperson 2. Dyn's chief scientist 3. .CO Internet spokeswoman 4. Twitter representative 5. Mikko Hypponen, chief research officer for F-Secure 6. Twitter's original announcement about t.co

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident related to Twitter's t.co domain going offline due to a human error at Melbourne IT has happened again within the same organization. The incident occurred when Melbourne IT inadvertently placed the t.co domain on hold while responding to a phishing complaint, causing t.co links to stop working. This incident highlights the potential risks associated with centralizing services like link shortening ([15038]). (b) The software failure incident related to the t.co domain going offline has also happened at other organizations or with their products and services. The article mentions that Dyn, a company providing domain name system connectivity for Twitter's t.co service, was initially thought to be the cause of the issue. Additionally, the article discusses the role of .CO Internet in managing the t.co domain, indicating that similar incidents could potentially occur with other organizations using the .co domain suffix ([15038]).
Phase (Design/Operation) design (a) The software failure incident in the article was primarily due to a design-related issue. The outage that broke hyperlinks on Twitter was caused by a simple human error at a Melbourne-based hosting firm while responding to an abuse complaint. This error led to the t.co domain being placed on hold, resulting in millions of Twitter users receiving "non-existent domain" errors when trying to follow links. The introduction of a central point of failure with the t.co domain abbreviation system played a significant role in the incident [15038]. (b) The software failure incident was not primarily due to operation-related factors such as misuse of the system. The outage was caused by a design flaw introduced during the system development and maintenance processes, specifically the human error at the hosting firm that led to the t.co domain being placed on hold [15038].
Boundary (Internal/External) within_system (a) The software failure incident reported in the articles was primarily within the system. The failure originated from a simple human error at a Melbourne-based hosting firm that was responding to an abuse complaint [15038]. The incident occurred when the t.co domain used by Twitter for hyperlink abbreviation was mistakenly placed on hold by Melbourne IT's policy team while actioning a phishing complaint. This internal error led to the t.co domain going offline, causing millions of Twitter users to receive "non-existent domain" errors when trying to follow links. The issue was rectified within approximately 40 minutes, highlighting that the failure was within the system and not due to external factors.
Nature (Human/Non-human) human_actions (a) The software failure incident occurred due to non-human actions, specifically a simple human error at a Melbourne-based hosting firm that was responding to an abuse complaint. This error led to the t.co domain being placed on hold, causing the outage that broke hyperlinks on Twitter [15038]. (b) The failure was also attributed to human actions, as the error that caused the outage was made by Melbourne IT's policy team while actioning a phishing complaint. The human error of placing the t.co domain on hold inadvertently led to the disruption in service for Twitter users [15038].
Dimension (Hardware/Software) software (a) The software failure incident in the article was not attributed to hardware issues but rather to a simple human error at a Melbourne-based hosting firm that was responding to an abuse complaint [15038]. (b) The software failure incident was primarily attributed to a simple human error at a hosting firm that resulted in the t.co domain being placed on hold, causing hyperlinks on Twitter to break. This error was related to the process of actioning a phishing complaint, indicating a software-related failure originating from human actions and processes [15038].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident was non-malicious. The outage that broke hyperlinks on Twitter was caused by a simple human error at a Melbourne-based hosting firm responding to an abuse complaint. The error occurred when the t.co domain was inadvertently placed on hold while actioning a phishing complaint, leading to millions of Twitter users receiving "non-existent domain" errors when trying to follow links. The issue was realized and rectified in approximately 40 minutes [15038].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The intent of the software failure incident was accidental_decisions. The incident originated with a simple human error at a Melbourne-based hosting firm that was responding to an abuse complaint. The error occurred when Melbourne IT's policy team inadvertently placed the t.co domain on hold while actioning a phishing complaint. This accidental decision led to the t.co domain going offline, causing millions of Twitter users to receive "non-existent domain" errors when trying to follow links. The error was realized and rectified in approximately 40 minutes [15038].
Capability (Incompetence/Accidental) accidental (a) The software failure incident was not due to development incompetence but rather an accidental human error at a Melbourne-based hosting firm responding to an abuse complaint. The incident occurred when Melbourne IT's policy team inadvertently placed the t.co domain on hold while actioning a phishing complaint, causing the t.co domain to go offline and resulting in millions of Twitter users receiving "non-existent domain" errors when trying to follow links [15038]. (b) The software failure incident was accidental in nature, stemming from a human error made while responding to an abuse complaint. Melbourne IT's policy team accidentally placed the t.co domain on hold during the process of actioning a phishing complaint, leading to the t.co domain going offline and causing link failures for Twitter users. The error was realized and rectified in approximately 40 minutes [15038].
Duration temporary (a) The software failure incident in this case was temporary. The outage that broke hyperlinks on Twitter lasted less than an hour [15038]. The issue was realized and rectified in approximately 40 minutes, and t.co links began working again [15038].
Behaviour crash, omission, value, other (a) crash: The software failure incident described in the article can be categorized as a crash. The outage that broke hyperlinks on Twitter resulted in the system losing its state and not performing its intended functions. Users received "non-existent domain" errors when trying to follow links due to the failure of the t.co domain, which went offline [Article 15038]. (b) omission: The incident can also be classified as an omission. The system omitted to perform its intended functions at an instance when the t.co domain was placed on hold inadvertently while actioning a phishing complaint, leading to the hyperlinks not working for a period of time [Article 15038]. (c) timing: The timing of the software failure incident is not the primary issue in this case. The system did not fail due to performing its intended functions too late or too early [Article 15038]. (d) value: The failure can be attributed to the system performing its intended functions incorrectly. The error occurred when the t.co domain was mistakenly placed on hold while responding to a phishing complaint, causing the hyperlinks to break and users to encounter errors [Article 15038]. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. The issue primarily stemmed from a human error at a hosting firm and the subsequent actions taken in response to a phishing complaint [Article 15038]. (f) other: The other behavior exhibited in this software failure incident is the introduction of a central point of failure that did not exist before. By routing all outbound links through the t.co domain, Twitter inadvertently created a single point of failure, which led to the outage when the domain went offline [Article 15038].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence theoretical_consequence (a) unknown (b) unknown (c) unknown (d) unknown (e) unknown (f) unknown (g) no_consequence (h) theoretical_consequence (i) The article discusses the potential consequences of the software failure incident, such as the fragility of complex systems and the impact of a central point of failure introduced by routing all outbound links through t.co. The outage lasted less than an hour and did not result in any significant real-world consequences beyond inconvenience for users in certain regions. The incident highlighted the vulnerability of relying on a single point of failure in a system [15038].
Domain information (a) The failed system was related to the information industry as it involved the outage of hyperlinks on Twitter, impacting the distribution of information to millions of users [Article 15038].

Sources

Back to List