Incident: Hardware Failure in Computer Systems Causing Memory-Related Crashes

Published Date: 2012-08-30

Postmortem Analysis
Timeline 1. The software failure incident happened approximately two years ago, as mentioned in the article [13690]. 2. The article was published on 2012-08-30. 3. Therefore, the software failure incident occurred around August 2010.
System 1. Dynamic random-access memory (DRAM) [13690] 2. Memory module on Stephen Jakisa's PC [13690]
Responsible Organization 1. Bad memory chip on Jakisa's PC [13690] 2. Hardware failures in computer systems [13690] 3. DRAM memory failures in Google's data centers [13690] 4. Hard errors in memory chips used by IBM Blue Gene Systems and SciNet [13690]
Impacted Organization 1. Stephen Jakisa [13690] 2. Google [13690] 3. IBM Blue Gene Systems [13690] 4. SciNet [13690]
Software Causes 1. Hardware failure due to a bad memory chip [13690]
Non-software Causes 1. Hardware failure due to a bad memory chip on the PC [13690] 2. Hardware defects such as old age or buggy manufacturing in dynamic random-access memory (DRAM) [13690] 3. Hard errors in memory chips used by Google's custom-designed Linux systems, IBM Blue Gene Systems, and the Canadian supercomputer SciNet [13690]
Impacts 1. The software failure incident caused Stephen Jakisa to experience serious computer problems while playing a game and using his web browser, leading to frequent crashes and the inability to install software on his PC [13690]. 2. The incident highlighted the prevalence of hardware bugs, specifically bad memory chips, as a significant factor in computer failures, challenging the common assumption that software is solely to blame for crashes [13690]. 3. Researchers discovered that hard errors in computer hardware, such as faulty DRAM memory chips, were more common than soft errors like cosmic rays, leading to unexpected failures in systems like Google's data centers and IBM Blue Gene Systems [13690]. 4. The lack of effective error-handling mechanisms for hard errors in high-end chips, compared to the existing solutions for soft errors, was identified as a significant issue causing more problems than commonly recognized, especially in consumer-grade devices lacking error-correcting code [13690].
Preventions 1. Conducting regular hardware maintenance and checks to detect faulty components like memory chips [13690]. 2. Implementing error-correcting code in high-end chips to handle hard errors more effectively [13690]. 3. Enhancing hardware reliability testing processes before shipping products to ensure long-term accuracy [13690].
Fixes 1. Identifying and replacing the faulty memory chip on the PC could fix the software failure incident [13690].
References 1. Stephen Jakisa 2. Ioan Stefanovici 3. University of Toronto 4. Google 5. IBM Blue Gene Systems 6. SciNet 7. AMD 8. Vilas Sridharan 9. Samsung

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) In the articles, it is mentioned that University of Toronto professor Bianca Schroeder and her team found that Google's custom-designed Linux systems experienced more errors than expected, with about eight percent of Google's memory chips responsible for 90 percent of the problems. The issues were concentrated on specific regions of the computer's memory and tended to occur in older machines. This indicates a recurring software failure incident within Google's infrastructure [13690]. (b) The research conducted by University of Toronto professor Bianca Schroeder and her team also found similar results on memory chips used by IBM Blue Gene Systems and a Canadian supercomputer called SciNet. Additionally, a paper by researchers at AMD highlighted that hard errors were more common than soft errors in DRAM memory chips. These findings suggest that the software failure incidents related to memory chip failures are not isolated to a single organization but are prevalent across different organizations and systems [13690].
Phase (Design/Operation) operation (a) The articles do not provide specific information about a software failure incident related to the design phase, where the failure is due to contributing factors introduced by system development, system updates, or procedures to operate or maintain the system. (b) The articles do mention a software failure incident related to the operation phase, where the failure is due to contributing factors introduced by the operation or misuse of the system. In the incident described in the article, Stephen Jakisa experienced computer problems while playing a game and using his web browser. The issues escalated to the point where he couldn't even install software on his PC. It was later discovered that the root cause of the problem was a bad memory chip on his PC, which was leading to the system failures [13690].
Boundary (Internal/External) within_system (a) within_system: The software failure incident discussed in the articles is primarily attributed to hardware issues within the system. Specifically, the incident was caused by a bad memory chip on the user's PC, leading to various problems such as computer crashes, browser malfunctions, and the inability to install software [13690]. The articles highlight the importance of considering hardware failures, such as memory chip defects, as a significant factor in software failures, challenging the common assumption that software bugs are the primary cause of system issues.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The incident described in the articles is related to hardware failures, specifically bad memory chips causing computer problems [13690]. - The articles discuss how hardware bugs, such as soft errors and hard errors, can lead to software failures. Soft errors are caused by factors like cosmic rays affecting microprocessor transistors, while hard errors can result from issues like heat or manufacturing defects causing components to wear out over time [13690]. - Research conducted at the University of Toronto and by AMD found that hard errors in DRAM memory chips were more common than soft errors, indicating that non-human factors like old age or buggy manufacturing can contribute to software failures [13690]. (b) The software failure incident occurring due to human actions: - The articles do not provide any information indicating that the software failure incident was caused by contributing factors introduced by human actions.
Dimension (Hardware/Software) hardware (a) The articles discuss software failure incidents related to hardware issues. For example, in the incident involving Stephen Jakisa, his computer problems were traced back to a bad memory chip on his PC, causing issues with gaming, web browsing, and software installation [13690]. Additionally, researchers at the University of Toronto found that failures in computer memory chips were more likely due to old age or manufacturing defects (hard errors) rather than soft errors caused by cosmic rays [13690]. (b) The articles also touch upon software failures that originate in software. While the incident with Stephen Jakisa was initially suspected to be caused by buggy software, it was ultimately determined to be a hardware issue with the memory chip [13690]. The articles do not provide specific examples of software failures originating solely from software-related issues.
Objective (Malicious/Non-malicious) non-malicious (a) The articles do not mention any software failure incident caused by malicious intent. (b) The software failure incidents discussed in the articles are non-malicious in nature. They are attributed to hardware issues such as bad memory chips causing computer crashes and errors in DRAM memory chips leading to system failures [13690]. These failures are not intentional but rather a result of hardware defects or aging components.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The articles do not provide information about a software failure incident related to poor decisions. (b) The software failure incident related to accidental decisions is described in the article. The incident involved a computer problem experienced by Stephen Jakisa while playing a game and using his web browser. Initially, he suspected a virus or buggy software but later discovered that the issue was caused by a bad memory chip on his PC. This incident was accidental and not due to poor decisions [13690].
Capability (Incompetence/Accidental) accidental (a) The articles do not provide information about the software failure incident occurring due to development incompetence. (b) The software failure incident mentioned in the articles was not accidental but was traced back to a bad memory chip on the PC, causing issues with the computer's performance [13690].
Duration permanent (a) The articles discuss hardware failures, particularly related to memory chips, which can lead to software failures. For example, in the case of Stephen Jakisa, his computer problems were traced back to a bad memory chip [13690]. The articles also mention that hard errors in DRAM memory chips are more common than soft errors, and these hard errors can cause significant problems [13690]. (b) The articles also mention that soft errors, caused by factors like cosmic rays, can lead to software failures. However, the focus is more on hard errors caused by old age or manufacturing defects in memory chips, which can result in more frequent and severe failures [13690].
Behaviour crash (a) crash: The article mentions a software failure incident where a computer crashed, indicated by a blue screen, possibly due to bad memory in the computer's video card [13690]. (b) omission: The article does not specifically mention a software failure incident related to omission. (c) timing: The article does not specifically mention a software failure incident related to timing. (d) value: The article does not specifically mention a software failure incident related to value. (e) byzantine: The article does not specifically mention a software failure incident related to Byzantine behavior. (f) other: The article does not provide information on any other specific behavior of software failure incidents.

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence no_consequence (a) death: There is no mention of any deaths resulting from the software failure incident in the provided article [13690].
Domain information, entertainment (a) The failed system was related to the information industry as it involved a computer system failure experienced by a programmer, Stephen Jakisa, while playing a video game and using a web browser [13690]. The incident highlighted the importance of hardware reliability in computer systems, specifically focusing on memory chip failures rather than software bugs.

Sources

Back to List