Incident: McAfee Antivirus Update Causes Widespread Windows XP Crashes

Published Date: 2010-04-23

Postmortem Analysis
Timeline 1. The software failure incident involving McAfee's buggy antivirus update that caused computers to crash or keep rebooting happened on April 21, 2010 [Article 1406]. 2. The incident occurred on April 21, 2010.
System 1. McAfee's antivirus software update 2. Windows XP with Service Pack 3 3. SVCHOST.EXE file
Responsible Organization 1. McAfee [1598, 1406]
Impacted Organization 1. University of Michigan's medical school [1598, 1406] 2. Kentucky police [1598, 1406] 3. Intel [1598, 1406] 4. Rhode Island hospitals [1598, 1406] 5. Australian supermarket chain [1598]
Software Causes 1. Poor testing due to changes in McAfee's quality assurance process allowed the buggy DAT file to get past the test environment and onto customers' PCs [1598]. 2. The faulty update misidentified a key Windows file called svchost.exe as a virus, causing PCs to crash or keep rebooting [1598, 1406]. 3. The update redirected the PC's immune system to attack a legitimate operating system component (SVCHOST.EXE) due to a misidentification with malware [1406]. 4. The update caused McAfee's application to incorrectly confuse SVCHOST.EXE with the W32/Wecorl.a virus [1406].
Non-software Causes 1. Lack of proper testing procedures: McAfee acknowledged that the buggy antivirus update was a result of poor testing due to recent changes in their quality assurance process [1598]. 2. Inadequate communication and response: McAfee faced criticism for downplaying the impact of the issue and for not providing a swift and effective solution to the problem, leading to frustration among customers and system administrators [1406].
Impacts 1. McAfee's faulty antivirus update caused computers running Windows XP with Service Pack 3 to crash or keep rebooting, impacting customers worldwide, including chipmaker Intel, Rhode Island hospitals, Kentucky police, University of Michigan's medical school, and an Australian supermarket chain [1598]. 2. The University of Michigan's medical school reported that 8,000 of its 25,000 computers crashed, police in Lexington, Ky., resorted to hand-writing reports, some jails canceled visitation, and Rhode Island hospitals turned away non-trauma patients at emergency rooms and postponed elective surgeries [1406]. 3. Intel was also affected by the bungled update, with many of its computers in the United States running McAfee being impacted [1406]. 4. Enterprise users, especially system administrators, were forced to manually install the repair that McAfee had made available, causing significant disruptions and frustrations [1406]. 5. The software failure incident led to widespread condemnation on social media platforms like Twitter, with users expressing their dissatisfaction and frustration with McAfee's handling of the situation [1406].
Preventions 1. Implementing a more rigorous and thorough quality assurance process to catch bugs before updates are released to customers could have prevented the software failure incident [1598, 1406]. 2. Conducting extensive testing on updates that directly affect crucial Windows system files to avoid misidentifying legitimate files as viruses could have prevented the incident [1598]. 3. Enhancing the Artemis system to include a more comprehensive list of Windows system files to avoid mistakenly targeting important operating system components could have prevented the incident [1598]. 4. Ensuring that software updates are thoroughly tested and validated across a wide range of systems and configurations to prevent widespread impact on customers' computers could have prevented the incident [1406].
Fixes 1. Implementing new quality assurance steps to address updates that directly affect crucial Windows system files and beefing up the Artemis system to include a more comprehensive list of Windows system files to leave alone [1598]. 2. Providing a fix in the form of the SuperDAT Remediation Tool to stifle the updated driver causing the false positive and restore the svchost.exe file [1598]. 3. Working around the clock to help customers get their systems back online and offering support reps for further assistance [1598]. 4. Manually installing the repair that McAfee made available by midday for enterprise users who were the most affected by the update [1406]. 5. Posting detailed instructions on a separate site on how to fix XP computers that have been crashing because of the update, recommending the manual download and installation of an "EXTRA.DAT" file and restoring incorrectly quarantined files [1406].
References 1. McAfee (Barry McPherson, executive vice president of support and customer service) - [1598] 2. University of Michigan's medical school - [1598] 3. Kentucky police - [1598] 4. Intel - [1598] 5. Rhode Island hospitals - [1598] 6. CNET - [1406] 7. Tech-related mailing lists - [1406] 8. Twitter users - [1406] 9. Internet Storm Center - [1406] 10. District of Columbia's deputy chief information officer - [1406]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident happened again at McAfee. The incident described in the articles is related to a buggy antivirus update that caused computers running Windows XP with Service Pack 3 to crash or keep rebooting. McAfee acknowledged that the problem occurred due to poor testing and a faulty DAT file that misidentified a key Windows file as a virus ([1598], [1406]). (b) The software failure incident also affected other organizations besides McAfee. The incident impacted various customers worldwide, including chipmaker Intel, Rhode Island hospitals, Kentucky police, University of Michigan's medical school, and an Australian supermarket chain. These organizations experienced issues such as computer crashes, lost productivity, and disruptions due to the faulty antivirus update released by McAfee ([1598], [1406]).
Phase (Design/Operation) design, operation (a) The software failure incident with McAfee's antivirus update was primarily attributed to poor testing as a contributing factor introduced during system development. McAfee acknowledged that the buggy DAT file got past the test environment due to changes in its quality assurance process [1598]. The update misidentified a key Windows file as a virus, causing PCs to crash or keep rebooting, impacting customers worldwide [1598]. McAfee also mentioned adding new QA steps to address updates affecting crucial Windows system files to prevent such incidents in the future [1598]. (b) The software failure incident also involved contributing factors introduced during the operation or misuse of the system. The faulty update released by McAfee redirected the PC's immune system to attack a legitimate operating system component, causing widespread computer crashes and reboots [1406]. System administrators were forced to manually install the repair on affected computers, indicating operational challenges faced by users [1406]. The incident led to significant disruptions for enterprise users, with complaints flooding tech-related mailing lists and social media platforms [1406].
Boundary (Internal/External) within_system, outside_system (a) within_system: The software failure incident with McAfee's antivirus update was primarily caused by poor testing within the system. McAfee acknowledged that the buggy DAT file was able to get past the test environment due to changes in their quality assurance process, leading to the misidentification of a key Windows file as a virus and causing PCs to crash or keep rebooting [1598]. (b) outside_system: The software failure incident also had contributing factors originating from outside the system. For example, the update released by McAfee caused widespread damage affecting various organizations and individuals outside of McAfee's immediate control, such as the University of Michigan's medical school, Kentucky police, Intel, Rhode Island hospitals, and an Australian supermarket chain [1598]. Additionally, the incident led to significant disruptions for enterprise users, system administrators, and individuals who were forced to manually install the repair provided by McAfee [1406].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurred due to non-human actions, specifically poor testing processes at McAfee. McAfee acknowledged that the buggy antivirus update was a result of a faulty DAT file that misidentified a key Windows file as a virus, causing PCs to crash or keep rebooting. This issue arose because McAfee recently changed its quality assurance process, allowing the buggy DAT file to pass through the testing environment and onto customers' PCs [1598]. (b) The software failure incident also involved human actions. McAfee's executive vice president of support and customer service, Barry McPherson, issued an apology on behalf of the company for the chaos caused by the faulty antivirus update. McAfee admitted that the problem was a result of poor testing, indicating a failure in the human-driven quality assurance process that led to the release of the buggy update [1598]. Additionally, McAfee apologized to customers for the problem caused by the update that turned the software's defenses against a vital component of Microsoft Windows, impacting tens of thousands of computers [1406].
Dimension (Hardware/Software) software (a) The software failure incident reported in the articles was primarily due to contributing factors originating in software. McAfee pushed out a buggy antivirus update that misidentified a key Windows file as a virus, causing computers to crash or keep rebooting [1598]. The update redirected the PC's immune system to attack a legitimate operating system component, SVCHOST.EXE, due to a mistake in McAfee's application [1406]. The incident was a result of poor testing and a faulty DAT file getting past the test environment [1598]. (b) There is no specific information in the articles pointing to the software failure incident being caused by contributing factors originating in hardware.
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the articles was non-malicious. McAfee's buggy antivirus update that caused computers to crash or repeatedly reboot was a result of poor testing and a faulty DAT file misidentifying a key Windows file as a virus. There is no indication in the articles that the failure was due to any malicious intent to harm the system [1598, 1406].
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The software failure incident involving McAfee's buggy antivirus update was primarily due to poor decisions made by the company. McAfee acknowledged that the problem occurred due to poor testing processes that allowed the faulty DAT file to pass through the quality assurance process and onto customers' PCs [1598]. The company had recently changed its quality assurance process, which contributed to the buggy update being released and causing chaos for many customers [1598]. Additionally, McAfee mentioned that they would be adding new QA steps to address updates that directly affect crucial Windows system files to prevent such incidents in the future [1598]. (b) The software failure incident was also a result of accidental decisions or mistakes made by McAfee. The update released by McAfee early in the day inadvertently turned the antivirus software's defenses against a vital component of Microsoft Windows, causing computers to crash or repeatedly reboot [1406]. McAfee apologized for the problem and downplayed its impact, stating that they were not aware of significant impact on consumers [1406]. The company faced criticism for its initial recommendation to users to download a file from a support site, which led to further issues as the site went offline and returned an error message due to the influx of irate users [1406].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident with McAfee's antivirus update was primarily attributed to development incompetence. McAfee acknowledged that the buggy update was a result of poor testing due to recent changes in their quality assurance process, which allowed the faulty DAT file to pass testing and reach customers' PCs [1598]. The update misidentified a key Windows file as a virus, causing widespread computer crashes and reboots, impacting various organizations and individuals globally [1598]. (b) Additionally, the incident can also be categorized as accidental, as McAfee released the faulty update early in the day, causing the software's defenses to attack a vital component of Microsoft Windows unintentionally. McAfee apologized for the problem, stating they were not aware of significant impact on consumers, downplaying the severity of the issue [1406]. The update effectively redirected the PC's immune system to attack a legitimate operating system component, SVCHOST.EXE, due to a misidentification with malware, leading to a day of disruptions and complaints from affected users [1406].
Duration temporary From the provided articles [1598, 1406], the software failure incident involving McAfee's buggy antivirus update causing computers to crash or repeatedly reboot was a temporary failure. The incident was temporary because it was caused by a specific buggy update that was released at 6 a.m. PT on a particular day, affecting a significant number of Windows XP computers running Service Pack 3. McAfee acknowledged the issue, halted distribution of the update, and provided a fix by midday. The company also worked on a patch to address the false positive identification of a legitimate Windows file as a virus. Additionally, McAfee continued to work on an automated solution to resolve the issue, indicating that the failure was not permanent but rather a result of specific circumstances related to the faulty update.
Behaviour crash, value, other (a) crash: The software failure incident described in the articles resulted in crashes of computers running Windows XP with Service Pack 3. The faulty update misidentified a key Windows file called svchost.exe as a virus, causing PCs to crash or keep rebooting [1598]. The update released by McAfee redirected the PC's immune system, causing it to attack a legitimate operating system component known as SVCHOST.EXE, similar to how some diseases can cause the human immune system to turn inward [1406]. (b) omission: The software failure incident did not specifically mention any instances of the system omitting to perform its intended functions at an instance(s). (c) timing: The software failure incident did not involve the system performing its intended functions correctly, but too late or too early. (d) value: The software failure incident involved the system performing its intended functions incorrectly. The faulty update misidentified a key Windows file as a virus, leading to crashes and reboots of PCs [1598]. The update caused the software's defenses to attack a vital component of Microsoft Windows, SVCHOST.EXE, incorrectly confusing it with malware [1406]. (e) byzantine: The software failure incident did not exhibit the system behaving erroneously with inconsistent responses and interactions. (f) other: The software failure incident involved a failure due to poor testing, which allowed the buggy DAT file to get past the test environment and onto the PCs of customers [1598]. Additionally, the incident led to widespread impacts, such as crashing computers, disrupted operations in various organizations, and the need for manual repairs by system administrators [1406].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence The consequence of the software failure incident described in the articles is primarily categorized under the following options: (d) property: The software failure incident caused significant property damage and disruption. For example, the faulty antivirus update from McAfee misidentified a key Windows file as a virus, leading to PCs crashing or repeatedly rebooting, impacting various organizations and individuals [1598, 1406]. (e) delay: The software failure incident caused delays in operations and activities for affected organizations and individuals. For instance, police in Lexington, Ky., had to resort to hand-writing reports and turn off their patrol car terminals, jails canceled visitation, and hospitals had to turn away non-trauma patients and postpone elective surgeries [1406]. (h) theoretical_consequence: While there were no reports of actual deaths or physical harm caused by the software failure incident, there were potential consequences discussed, such as the disruption in critical services like healthcare and law enforcement, which could have potentially led to harm or more severe consequences [1598, 1406].
Domain information, health (a) The failed system was intended to support the information industry. The McAfee antivirus software failure incident affected various organizations and institutions heavily reliant on computers and information systems, such as chipmaker Intel, Rhode Island hospitals, Kentucky police, University of Michigan's medical school, and an Australian supermarket chain [1598, 1406].

Sources

Back to List