Incident Details

Incident: eCensus Website Shutdown: DDoS Attack Impacting Australian Taxpayers

Published Date: 2016-10-25

Postmortem Analysis
Timeline	1. The software failure incident happened in August 2016. [48494, 49158]
System	1. IBM's geo-blocking protocol 2. Router configuration and rebooting process [#, #]
Responsible Organization	1. IBM Australia, specifically its managing director Kerry Purcell and managing senior engineer Michael Shallcross, were responsible for causing the software failure incident [48494, 49158].
Impacted Organization	1. Australian taxpayers [48494, 49158] 2. Australian Bureau of Statistics (ABS) [48494, 49158] 3. IBM [48494, 49158]
Software Causes	1. The failure incident was caused by distributed denial-of-service (DDoS) attacks on the eCensus website, leading to its shutdown for over 40 hours [48494, 49158]. 2. The incident was exacerbated by a failure in the geo-blocking service, which was not properly applied by an internet service provider (ISP) as expected, allowing foreign traffic to bypass the protection measures [48494, 49158]. 3. There were issues with the routers, as restarting them after the attacks revealed configuration problems that could have been resolved earlier if tested properly [49158].
Non-software Causes	1. The failure incident was caused by a distributed denial-of-service (DDoS) attack, specifically the fourth attack that struck down the website on August 9, which was foreign-sourced and not considered significant in the industry [48494, 49158]. 2. The incident was exacerbated by a failure in the geo-blocking service during the DDoS attack, as identified by the prime minister's special advisor on cyber security [48494, 49158]. 3. There was a lack of proper implementation of geo-blocking by the internet sub-contractors, which allowed foreign traffic through Singapore despite assurances that geo-blocking was in place [48494]. 4. The incident was also attributed to a configuration problem with the routers, which could have been resolved earlier by power cycling the routers [49158].
Impacts	1. The software failure incident led to the eCensus website being offline for over 40 hours, causing inconvenience to thousands of Australians trying to input their data and costing taxpayers up to $30 million [48494, 49158]. 2. The incident resulted in the ABS having to keep the website offline for an additional 40 hours despite IBM being prepared to relaunch it sooner [48494]. 3. Participants trying to access the eCensus website encountered error messages, leading to frustration and complaints on social media [48494, 49158]. 4. The failure to properly implement geo-blocking and address DDoS attacks caused the website to become unresponsive during the incident [48494, 49158]. 5. The incident highlighted a lack of coordination and communication between IBM, its sub-contractors, and the ABS in handling the DDoS attacks and ensuring the website's availability [48494, 49158].
Preventions	1. Implementing proper geo-blocking protocols to prevent DDoS attacks [48494, 49158] 2. Conducting more thorough testing on routers and ensuring they are properly configured [49158] 3. Ensuring all sub-contractors are fully aware of and capable of implementing necessary security measures [48494] 4. Having a backup router in place to handle overwhelming traffic [48494] 5. Reacting promptly to incidents by restarting routers to resolve configuration problems [49158]
Fixes	1. Turning the router's power 'off and on again' could have solved the problem earlier [Article 49158]. 2. Implementing geo-blocking properly and ensuring it is in place to prevent DDoS attacks [Article 48494]. 3. Conducting more testing on the routers to ensure they can implement geo-blocking directions effectively [Article 48494].
References	1. IBM managing director Kerry Purcell 2. IBM Australia managing senior engineer Michael Shallcross 3. Australian Bureau of Statistics (ABS) 4. Treasury Secretary John Fraser 5. Prime minister's special advisor on cyber security Alastair MacGibbon 6. Vocus 7. NextGen 8. Daily Telegraph 9. Senate estimates hearing 10. Census Australia participants 11. Social media users 12. Revolution IT Pty Ltd 13. EFTM 14. Senate committee 15. Australian Signals Directorate [48494, 49158]

Software Taxonomy of Faults

Category	Option	Rationale
Recurring	one_organization	(a) The software failure incident happened again at one_organization: The incident involving the eCensus website shutdown in Australia in 2016 was attributed to a distributed denial-of-service (DDoS) attack and issues with geo-blocking protocols. IBM, the company responsible for developing and running the eCensus website, faced a similar incident again when the website went offline for over 40 hours due to these issues. Despite IBM's claims of anticipating and planning for DDoS attacks using geo-blocking, the incident still occurred, leading to significant inconvenience and financial losses for the Australian government and taxpayers [48494, 49158]. (b) The software failure incident happened again at multiple_organization: There is no specific mention in the provided articles about the software failure incident happening again at other organizations or with their products and services.
Phase (Design/Operation)	design, operation	(a) The software failure incident related to the design phase was primarily due to the failure of geo-blocking protocols not being properly applied by the internet service provider (ISP) as part of the system development. IBM, the contractor responsible for the eCensus website, mentioned that the incident was caused by a geo-blocking protocol not being applied by the ISP, leading to distributed denial-of-service (DDoS) attacks [48494]. (b) The software failure incident related to the operation phase was highlighted by IBM's managing senior engineer Michael Shallcross, who suggested that a simple solution of turning the router's power 'off and on again' could have solved the problem earlier. This indicates that operational issues, such as router configuration problems, contributed to the system failure during operation [49158].
Boundary (Internal/External)	within_system, outside_system	(a) within_system: The software failure incident related to the eCensus website shutdown in Australia was primarily due to factors originating from within the system. IBM, the company responsible for developing and running the eCensus website, faced issues such as distributed denial-of-service (DDoS) attacks, misconfiguration of routers, failure in geo-blocking protocols, and inadequate testing of the system's resilience to attacks [48494, 49158]. These internal factors contributed to the website going offline for over 40 hours, causing inconvenience to the Australian public and the government. (b) outside_system: On the other hand, external factors also played a role in the software failure incident. The incident involved DDoS attacks that originated externally, causing disruptions to the eCensus website [48494, 49158]. Additionally, there were issues with the implementation of geo-blocking by internet sub-contractors, which led to foreign traffic bypassing the intended restrictions and affecting the website's performance [48494].
Nature (Human/Non-human)	non-human_actions, human_actions	(a) The software failure incident occurring due to non-human actions: - The software failure incident was attributed to a distributed denial-of-service (DDoS) attack on the eCensus website, causing it to go offline for over 40 hours [48494]. - IBM had implemented protection measures such as geo-blocking, known as 'Island Australia,' to defend against DDoS attacks [48494]. - The incident involved a failure in the geo-blocking service during the fourth DDoS attack, leading to the website becoming unresponsive [49158]. - The DDoS attack traffic peaked at 563Mbps and lasted 14 minutes, which was considered significant in the industry [49158]. (b) The software failure incident occurring due to human actions: - IBM managing director Kerry Purcell took full responsibility for the Census website meltdown but insisted the website was not hacked [48494]. - There was a blame-game between the contractor and its sub-contractors over the DDoS attacks, with issues related to the geo-blocking protocol not being applied by an internet service provider [48494]. - IBM engineer Michael Shallcross mentioned that greater certainty from sub-contractors regarding the implementation of geo-blocking directions and more testing on the routers could have been beneficial [48494]. - The ABS stated that the risk of DDoS attacks was not adequately addressed by IBM, indicating a failure in managing risk related to the incident [49158].
Dimension (Hardware/Software)	hardware, software	(a) The software failure incident occurring due to hardware: - In Article 49158, IBM Australia's managing senior engineer Michael Shallcross mentioned that a simple solution to the problem could have been turning the router's power 'off and on again', indicating a hardware-related issue [49158]. (b) The software failure incident occurring due to software: - The software failure incident in both articles primarily stemmed from software-related issues such as the failure to properly implement geo-blocking protocols, configuration problems, and the misidentification of normal traffic patterns as data exfiltration [48494, 49158].
Objective (Malicious/Non-malicious)	non-malicious	(a) The software failure incident was non-malicious. The incident was attributed to a distributed denial-of-service (DDoS) attack on the eCensus website, causing it to go offline for over 40 hours. IBM, the contractor responsible for developing and running the eCensus, stated that the failure was not due to a hack and that no personal information of participants had been compromised [48494, 49158]. The incident involved issues with geo-blocking protocols not being properly applied by internet service providers, leading to the website being overwhelmed by foreign traffic. IBM had anticipated and planned for DDoS attacks using geo-blocking protection, known as 'Island Australia,' but there were failures in its implementation. The incident also highlighted miscommunication and lack of coordination between IBM and its sub-contractors regarding security measures [48494, 49158].
Intent (Poor/Accidental Decisions)	poor_decisions, accidental_decisions	(a) The software failure incident was related to poor_decisions. IBM's handling of the eCensus website shutdown was attributed to poor decisions, such as not properly implementing geo-blocking protocols and failing to address risks adequately. The incident involved a distributed denial-of-service (DDoS) attack that overwhelmed the website, leading to its shutdown for over 40 hours. Despite IBM's claims of anticipating and planning for DDoS attacks, the failure to effectively implement geo-blocking and address configuration issues ultimately led to the costly shutdown [48494, 49158]. (b) The software failure incident also involved accidental_decisions. For example, IBM's senior engineer mentioned that a simple solution like power cycling the router could have potentially resolved the issue earlier, indicating that the failure could have been due to an unintended mistake in not trying this approach sooner. Additionally, there were instances where sub-contractors failed to properly implement geo-blocking despite being instructed to do so, leading to continued vulnerabilities in the system [49158].
Capability (Incompetence/Accidental)	development_incompetence, accidental	(a) The software failure incident occurred due to development incompetence. IBM, the company responsible for the Census website, faced criticism for the failure, with the managing director taking full responsibility for the meltdown. The incident was attributed to a distributed denial-of-service (DDoS) attack that overwhelmed the website, leading to it being offline for over 40 hours. Despite IBM's claims of anticipating and planning for DDoS attacks using geo-blocking, the incident highlighted failures in implementing proper security measures and protocols [48494, 49158]. (b) The software failure incident also had accidental contributing factors. For example, IBM's senior engineer mentioned that a simple solution like power cycling the router could have potentially solved the problem earlier, indicating that the issue might have been overlooked or not tested thoroughly during development. Additionally, there were instances where normal traffic patterns were falsely identified as data exfiltration, leading to misinterpretations and disruptions in the system [49158].
Duration	temporary	(a) The software failure incident was temporary. The eCensus website shutdown lasted over 40 hours as thousands tried to input their data [48494]. IBM was prepared to relaunch the website three hours after it failed, but the Australian Bureau of Statistics insisted it be kept offline for a further 40 hours [48494]. The incident was attributed to distributed denial-of-service (DDoS) attacks and issues with geo-blocking protocols not being applied by an internet service provider [48494]. IBM claimed to have successfully defended against further DDoS attacks on the site [48494]. (b) The software failure incident was temporary due to contributing factors introduced by certain circumstances but not all. IBM Australia managing senior engineer Michael Shallcross mentioned that turning the router's power 'off and on again' could have solved the problem earlier, indicating a specific technical issue that could have been addressed [49158].
Behaviour	crash, omission, other	(a) crash: The software failure incident in the articles can be categorized as a crash. The eCensus website experienced a meltdown on August 9, leading to it going offline for over 40 hours, preventing users from inputting their data [48494, 49158]. (b) omission: The incident can also be classified as an omission failure. Despite IBM's preparations for DDoS attacks using geo-blocking, the system omitted to effectively block the attack traffic, leading to the website becoming unresponsive [48494, 49158]. (c) timing: There is no specific indication in the articles that the software failure incident was related to timing issues. (d) value: The incident does not align with a value failure where the system performs its intended functions incorrectly. (e) byzantine: The incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. (f) other: The other behavior observed in the incident is the failure due to the system falsely identifying normal traffic patterns as data exfiltration, leading to misinterpretation and subsequent issues [49158].

IoT System Layer

Layer	Option	Rationale
Perception	None	None
Communication	None	None
Application	None	None

Other Details

Category	Option	Rationale
Consequence	property, delay, non-human, theoretical_consequence	(a) death: People lost their lives due to the software failure - No information about any deaths caused by the software failure incident was mentioned in the articles [48494, 49158]. (b) harm: People were physically harmed due to the software failure - No information about physical harm to individuals due to the software failure incident was provided in the articles [48494, 49158]. (c) basic: People's access to food or shelter was impacted because of the software failure - No information about people's access to food or shelter being impacted by the software failure incident was discussed in the articles [48494, 49158]. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident resulted in significant financial consequences, costing Australian taxpayers up to $30 million [48494, 49158]. (e) delay: People had to postpone an activity due to the software failure - Thousands of Australians were unable to input their data into the Census website due to the software failure, leading to inconvenience and frustration [48494, 49158]. (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident affected the functioning of the eCensus website, leading to it being offline for over 40 hours and displaying error messages [48494, 49158]. (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had real observed consequences, including financial losses, inconvenience to users, and the website being offline for an extended period [48494, 49158]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The potential consequences discussed included the risk of personal information exposure due to the failure, but it was assured that no personal information of participants had been compromised [48494]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - No other specific consequences of the software failure incident were mentioned in the articles [48494, 49158].
Domain	information, government	(a) The failed system was intended to support the production and distribution of information. The software failure incident was related to the 2016 eCensus website shutdown in Australia, which was a platform for millions of residents to fill out the compulsory survey on August 9. The Australian Bureau of Statistics paid nearly $10 million to develop the official census website, and the system was designed to handle one million form submissions per hour [48494, 49158].

Sources

IBM boss apologises for Census fail costing taxpayers up to $30mm - Daily Mail - Published on: 2016-10-25
Article ID: 48494
IBM boss apologises for Census fail costing taxpayers up to $30mm - Daily Mail - Published on: 2016-10-24
Article ID: 49158

Back to List