Incident: Microsoft Excel Data Limit Leads to Covid Test Result Misplacement.

Published Date: 2020-10-05

Postmortem Analysis
Timeline 1. The software failure incident happened between 25 September and 2 October 2020 as mentioned in Article [106461]. This incident can be estimated to have occurred in September 2020 based on the information provided in the articles.
System 1. Microsoft Excel software [106096, 106461] 2. Public Health England's automatic process using an old file format (XLS) [106461]
Responsible Organization 1. Public Health England (PHE) [106096, 106461] 2. NHS Test and Trace [106461]
Impacted Organization 1. Public Health England (PHE) [106096, 106461] 2. NHS Test and Trace [106461]
Software Causes 1. The software cause of the failure incident was the 1 million-row limit on Microsoft Excel, which led to Public Health England misplacing nearly 16,000 Covid test results [106096, 106461]. 2. The issue was caused by the way Public Health England developers picked an old file format (XLS) to handle data, limiting each template to about 65,000 rows of data instead of the one million-plus rows that Excel is capable of [106461].
Non-software Causes 1. The incident was caused by some Microsoft Excel data files exceeding the maximum size after they were sent from NHS Test and Trace to Public Health England [106461]. 2. Public Health England (PHE) developers picked an old file format (XLS) to bring together logs produced by commercial firms, limiting each template to about 65,000 rows of data rather than the one million-plus rows that Excel is capable of handling [106461].
Impacts 1. Nearly 16,000 Covid-19 cases were left unreported in England, leading to potential delays in contact tracing and isolation of infected individuals [106096, 106461]. 2. Around 50,000 potentially infectious people may have been missed by contact tracers and not informed to self-isolate due to the missing test results [106096]. 3. The error caused a delay in reaching out to close contacts of those who tested positive, with only 51% of affected individuals reached by contact tracers [106461]. 4. The incident resulted in inaccuracies in the daily case figures reported on the government's coronavirus dashboard, with lower numbers reported than the actual cases [106461]. 5. The error led to a significant increase in reported cases in the North West of England, with a rise of 92.6% after accounting for the missing tests [106461]. 6. The software failure impacted the introduction of local restrictions, with concerns raised about the accuracy of the data used for decision-making [106461].
Preventions 1. Implementing automated processes for data handling instead of relying on manual input and Excel spreadsheets could have prevented the incident [106096, 106461]. 2. Using modern file formats that support larger data sizes, such as XLSX, instead of the outdated XLS format could have avoided the limitation on the number of rows in Excel templates [106461]. 3. Conducting thorough testing and validation of the data handling processes to identify and address potential issues before they lead to critical errors [106096, 106461]. 4. Providing training and guidelines on proper spreadsheet usage to prevent common errors and ensure data integrity [106096]. 5. Implementing stricter controls on data entry and validation to minimize the risk of errors caused by manual input [106096].
Fixes 1. Implement automated processes: To prevent manual errors and limitations of software like Microsoft Excel, implementing automated processes for data handling and reporting could help avoid such incidents in the future [106096, 106461]. 2. Upgrade software versions: Upgrading to newer versions of software that can handle larger datasets, such as Microsoft Excel with higher row limits, can prevent data truncation and loss due to size limitations [106096, 106461]. 3. Conduct thorough testing: Before deploying any data handling system, thorough testing should be conducted to ensure that it can handle the expected volume of data without issues like truncation or loss [106461]. 4. Provide training and guidelines: Educating users on the proper use of software tools like Excel and providing guidelines on data handling best practices can help prevent errors caused by misuse or lack of understanding of the software's limitations [106096].
References 1. Public Health England (PHE) [106096, 106461] 2. NHS Test and Trace [106461] 3. Health Secretary Matt Hancock [106461] 4. Labour's shadow health secretary Jonathan Ashworth [106461] 5. BBC [106461]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident related to the million-row limit on Microsoft Excel causing data errors has happened before at other organizations or with their products and services. In 2013, an Excel error at JPMorgan led to the loss of almost $6bn due to a cell mistakenly dividing by the sum of two interest rates instead of the average [106096]. (b) The software failure incident related to the Microsoft Excel data files exceeding the maximum size causing a technical glitch has happened again within the same organization. The incident occurred when some Microsoft Excel data files sent from NHS Test and Trace to Public Health England exceeded the maximum size, leading to nearly 16,000 Covid-19 cases going unreported in England [106461].
Phase (Design/Operation) design, operation (a) The software failure incident in the articles was primarily due to design factors introduced during system development. The incident occurred because Public Health England (PHE) used an old file format (XLS) to bring together logs produced by commercial firms for swab tests into Excel templates. This design choice limited each template to about 65,000 rows of data instead of the one million-plus rows Excel is capable of handling. As a result, when the total number of cases reached the limit, further cases were simply left off, leading to nearly 16,000 Covid-19 cases being unreported [106461]. (b) The software failure incident was also influenced by operational factors related to the operation of the system. The technical glitch that caused the error in reporting nearly 16,000 Covid-19 cases was a result of some Microsoft Excel data files exceeding the maximum size after being sent from NHS Test and Trace to Public Health England. This operational issue led to the exclusion of 15,841 cases from the UK daily case figures between 25 September and 2 October. The error was discovered overnight on a Friday, and although it was fixed, by Monday afternoon, only 51% of those affected had been reached by contact tracers [106461].
Boundary (Internal/External) within_system (a) The software failure incident related to the Excel data error leading to nearly 16,000 Covid test results being misplaced was primarily within the system. The incident was caused by the way Public Health England (PHE) developers picked an old file format (XLS) to handle data from commercial firms carrying out swab tests, limiting each template to about 65,000 rows of data instead of the one million-plus rows Excel is capable of handling [106461]. This internal decision within PHE's system led to the data error and subsequent failure in reporting the Covid-19 cases accurately.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in the articles was primarily due to non-human actions. Specifically, the incident was caused by a technical glitch related to the way Microsoft Excel data files were handled, leading to nearly 16,000 Covid-19 cases going unreported in England [106096, 106461]. The error occurred because the files exceeded the maximum size allowed in Excel, resulting in cases being left out of the daily case figures. This limitation was due to the file format chosen by Public Health England (PHE) developers, which could only handle about 65,000 rows of data instead of the one million-plus rows that Excel is capable of processing. (b) While the software failure incident was primarily due to non-human actions as described above, human actions also played a role in the incident. The article mentions that the PHE developers picked an old file format (XLS) to handle the data, which contributed to the limitation in the number of rows that could be processed [106461]. Additionally, the incident highlighted the importance of proper data handling and processing procedures, which involve human decisions and actions in setting up the automated processes for data aggregation and reporting.
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The incident was caused by some Microsoft Excel data files exceeding the maximum size after they were sent from NHS Test and Trace to Public Health England [106461]. - The problem was caused by the way the agency brought together logs produced by commercial firms into Excel templates, which had a limitation on the number of rows of data they could handle [106461]. (b) The software failure incident occurring due to software: - The error was caused by a technical glitch in Microsoft Excel software, where data files exceeded the maximum size, leading to cases being left out of the UK daily case figures [106461]. - The issue was caused by the PHE developers picking an old file format (XLS) to bring together data into Excel templates, which had limitations on the number of rows of data they could handle [106461]. - Microsoft Excel's million-row limit was a contributing factor to the misplacement of nearly 16,000 Covid test results by Public Health England [106096].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Excel data error in handling Covid-19 test results was non-malicious. The incident was caused by technical glitches and errors in the way Microsoft Excel was used to process and store the data, leading to the misreporting of nearly 16,000 Covid-19 cases [106096, 106461]. The error was attributed to the limitations of Excel in handling large datasets, specifically the million-row limit in Excel spreadsheets, which resulted in data being cut off and not displayed correctly. Additionally, the use of an old file format (XLS) by Public Health England (PHE) developers further exacerbated the issue by limiting the number of rows that could be handled by each template, causing cases to be left out [106461]. (b) The incident was not a result of malicious intent but rather a consequence of technical limitations and errors in the handling of data files using Microsoft Excel. There is no indication in the articles that the failure was caused by any deliberate actions to harm the system or manipulate the data for malicious purposes.
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident related to the Excel data error leading to nearly 16,000 Covid test results being misplaced by Public Health England can be attributed to poor decisions. The incident occurred due to the use of an old file format (XLS) by PHE developers to handle data in Excel templates, limiting each template to about 65,000 rows of data instead of the one million-plus rows that Excel is capable of handling [106461]. This poor decision in selecting the file format ultimately led to the truncation of data and the omission of thousands of test results from the official daily figures, impacting contact tracing efforts and potentially putting lives at risk.
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident related to development incompetence is evident in the articles. The incident where nearly 16,000 Covid test results were misplaced by Public Health England was caused by a million-row limit on Microsoft Excel, leading to positive tests being left off the official daily figures [106096, 106461]. This issue arose due to the manual handling of data and the use of an old Excel file format that limited the number of rows that could be handled, causing crucial data to be omitted. The incident highlights the consequences of not adapting systems to handle the increasing volume of data efficiently, showcasing a lack of professional competence in managing data effectively. (b) The software failure incident can also be attributed to accidental factors. The technical glitch that caused the error in reporting nearly 16,000 Covid-19 cases in England was described as a "technical error" by Health Secretary Matt Hancock, emphasizing that it "should never have happened" [106461]. The error was caused by the way data files were handled in Microsoft Excel, exceeding the maximum size and leading to cases being left out of the daily case figures. This accidental oversight in handling the data and using an outdated file format inadvertently resulted in the failure to report crucial Covid test results, putting lives at risk and impacting the government's assessment of the epidemic.
Duration permanent (a) The software failure incident related to the Excel data files exceeding the maximum size, leading to nearly 16,000 Covid-19 cases going unreported in England, can be considered a temporary failure. This failure was caused by the way Public Health England (PHE) brought together logs produced by commercial firms into Excel templates using an old file format (XLS), which limited each template to about 65,000 rows of data instead of the one million-plus rows Excel is capable of handling. As a result, further cases were simply left off once the template reached its limit [106461]. (b) On the other hand, the limitations of Microsoft Excel in handling large datasets, such as the million-row limit, can be seen as a permanent contributing factor to software failures. The incident where a lab's daily test report in CSV format was loaded into Excel, causing the bottom rows to get cut off once the file exceeded Excel's row limit, showcases how the software's inherent limitations can lead to failures in data processing and reporting [106096].
Behaviour crash, omission, value, other (a) crash: The software failure incident in the articles can be categorized as a crash. The incident involving Microsoft Excel led to a significant data error where positive Covid-19 test results were left off the official daily figures due to a million-row limit in Excel. This crash resulted in potentially infectious people not being informed to self-isolate, indicating a failure of the system to perform its intended functions [106096, 106461]. (b) omission: The software failure incident can also be categorized as an omission. The error in the Excel files caused nearly 16,000 Covid-19 cases to go unreported in England, leading to the omission of these cases from the UK daily case figures. This omission meant that close contacts of those who tested positive were not traced, putting lives at risk [106461]. (c) timing: The software failure incident does not align with a timing failure. The issue was not related to the system performing its intended functions too late or too early; rather, it was a result of the system failing to handle the data correctly due to a limitation in the software [106096, 106461]. (d) value: The software failure incident can be associated with a value failure. The incident involved the system incorrectly handling the data due to the limitation of the Excel software, which led to positive test results being left out of the official daily figures. This incorrect handling of data resulted in a significant impact on contact tracing and public health measures [106096, 106461]. (e) byzantine: The software failure incident does not align with a byzantine failure. There were no indications of inconsistent responses or interactions from the system; instead, the failure was primarily due to the system reaching its limitations and not being able to process the data correctly [106096, 106461]. (f) other: The other behavior of the software failure incident could be described as a limitation failure. The incident was caused by the limitation of the Excel software in handling a large volume of data, leading to crucial Covid-19 test results being omitted from official reports. This limitation in the software's capacity resulted in a significant impact on public health data reporting and contact tracing efforts [106096, 106461].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, non-human, other (a) death: People lost their lives due to the software failure - No information in the articles suggests that people lost their lives due to the software failure incident. [106096, 106461] (b) harm: People were physically harmed due to the software failure - There is no mention of people being physically harmed due to the software failure incident. [106096, 106461] (c) basic: People's access to food or shelter was impacted because of the software failure - The software failure incident did not impact people's access to food or shelter. [106096, 106461] (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incident led to nearly 16,000 Covid-19 cases going unreported, affecting contact tracing efforts and potentially exposing thousands to the virus. It also resulted in incorrect daily case figures being reported, impacting decision-making and public health measures. [106096, 106461] (e) delay: People had to postpone an activity due to the software failure - The software failure incident did not directly result in people having to postpone an activity. [106096, 106461] (f) non-human: Non-human entities were impacted due to the software failure - The software failure incident impacted the accuracy of Covid-19 case reporting and contact tracing efforts, affecting data integrity and public health measures. [106096, 106461] (g) no_consequence: There were no real observed consequences of the software failure - The software failure incident had significant consequences, including the misreporting of Covid-19 cases, delays in contact tracing, and potential spread of the virus due to missed notifications. [106096, 106461] (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles do not mention any potential consequences discussed that did not occur. [106096, 106461] (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - The software failure incident resulted in inaccurate reporting of Covid-19 cases, potentially leading to underestimation of the true number of cases and impacting decision-making regarding public health measures and restrictions. It also highlighted the risks associated with manual data processing and the limitations of using Excel for critical data management tasks. [106096, 106461]
Domain health The software failure incident reported in the news articles is related to the **health** industry. The incident involved a technical glitch in Microsoft Excel that led to nearly 16,000 Covid-19 cases going unreported in England, impacting the tracking of positive cases and their close contacts, potentially putting lives at risk [Article 106096, Article 106461]. The system affected by the failure was intended to support the health industry by managing and reporting Covid-19 test results and facilitating contact tracing efforts.

Sources

Back to List