Incident: Cosmic Ray Neutrons Causing Supercomputer Failures at Los Alamos

Published Date: 2018-06-04

Postmortem Analysis
Timeline 1. The software failure incident mentioned in the article happened during the 1970s when the Cray-1 supercomputer experienced memory errors [72205]. 2. The article was published on 2018-06-04. 3. Estimating the timeline: Since the incident with the Cray-1 supercomputer occurred in the 1970s, and the article was published in 2018, we can estimate that the software failure incident happened in the 1970s.
System 1. Cray-1 supercomputer 2. Q supercomputer 3. Trinity machine
Responsible Organization 1. Cosmic-ray neutrons causing memory errors in the supercomputers at Los Alamos National Laboratory [72205]
Impacted Organization 1. Los Alamos National Laboratory [72205]
Software Causes 1. The software failure incident was caused by cosmic-ray neutrons slamming into processor parts, corrupting their data, leading to memory errors and crashes in the supercomputers at Los Alamos National Laboratory [72205].
Non-software Causes 1. Cosmic-ray neutrons slamming into processor parts, corrupting their data [72205] 2. Neutrons causing computer memory to flip bits, leading to errors [72205] 3. Neutrons from outer space colliding with chemicals in the atmosphere, breaking apart into smaller particles [72205]
Impacts 1. The software failure incident led to 152 unattributable memory errors in the Cray-1 supercomputer at Los Alamos National Laboratory [72205]. 2. The incident caused the Q supercomputer, installed in 2003, to crash more than expected, raising concerns about cosmic rays affecting the computer's performance [72205]. 3. Engineers at Los Alamos had to adapt their hard- and software to account for cosmic-ray neutrons that could corrupt computer data, leading to preemptive measures and stress-testing new equipment like the Trinity machine [72205]. 4. The incident highlighted the issue of silent data corruption, where bits flip without detection, potentially leading to incorrect results [72205]. 5. To mitigate risks, the engineers at Los Alamos intentionally crashed the computers when errors occurred, similar to falling down on purpose while skiing to avoid a more severe consequence [72205].
Preventions 1. Implementing neutron detectors inside the supercomputing center to measure the strength of cosmic ray neutron storms could have helped preempt the software failure incident [72205]. 2. Performing a cosmic stress-test on new equipment before installation by exposing the electronics to a beam of neutrons to observe their behavior could have prevented the incident [72205]. 3. Creating checkpoints throughout the computing process to save progress and enable recovery in case of errors could have mitigated the impact of the software failure incident [72205].
Fixes 1. Implementing neutron detectors throughout the supercomputing center to measure the strength of cosmic ray neutron storms [72205]. 2. Performing cosmic stress-tests on new equipment by exposing electronics to a beam of neutrons to preempt potential issues [72205]. 3. Creating checkpoints throughout the computing process to allow for system crashes and restarts if errors occur [72205]. 4. Increasing human intervention to verify results and prevent silent data corruption [72205].
References 1. Los Alamos National Laboratory [72205] 2. High Performance Computing Design group [72205]

Software Taxonomy of Faults

Category Option Rationale
Recurring unknown The articles do not provide specific information about a software failure incident happening again at the same organization or at multiple organizations. Therefore, the information to answer this question is 'unknown'.
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the article where it discusses the challenges faced by supercomputers at Los Alamos National Laboratory due to cosmic-ray neutrons causing memory errors in the processors. The article mentions incidents where the early supercomputer Q crashed more than expected, leading scientists to worry about cosmic rays affecting the system [72205]. (b) The software failure incident related to the operation phase is evident in the article's description of how supercomputers at Los Alamos National Laboratory handle errors caused by cosmic-ray neutrons. The article explains that when a system detects too many flipped bits due to neutron-induced errors, it intentionally crashes the computers to prevent further issues, similar to falling down on purpose while skiing to avoid a worse outcome [72205].
Boundary (Internal/External) outside_system The software failure incident described in the articles is related to the boundary of the system. The incident involves failures caused by contributing factors that originate from outside the system, specifically from cosmic-ray neutrons coming from outer space. These cosmic-ray neutrons can slam into processor parts, corrupting their data and causing memory errors in the supercomputers at Los Alamos National Laboratory [72205]. The lab's engineers have had to adapt their hard- and software to account for these external factors, such as placing neutron detectors throughout the supercomputing center and performing cosmic stress-tests on new equipment to preempt potential problems [72205].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: The article discusses how the Cray-1 supercomputer at Los Alamos National Laboratory experienced 152 unattributable memory errors during a six-month free trial period. It was later discovered that cosmic-ray neutrons from outer space could slam into processor parts, corrupting their data, leading to memory errors in the computer [72205]. (b) The software failure incident occurring due to human actions: The article mentions that before Los Alamos installs new equipment, like its Trinity machine, engineers perform a cosmic stress-test by placing the electronics in a beam of neutrons to observe their behavior. This proactive testing is done by humans to preempt potential problems caused by cosmic rays on the supercomputers [72205].
Dimension (Hardware/Software) hardware (a) The software failure incident occurring due to hardware: The articles mention a software failure incident related to hardware issues caused by cosmic-ray neutrons slamming into processor parts, corrupting their data. This hardware-related issue led to 152 unattributable memory errors in the Cray-1 supercomputer during a free trial at Los Alamos National Laboratory [72205]. (b) The software failure incident occurring due to software: The articles do not provide specific information about a software failure incident directly caused by software-related factors.
Objective (Malicious/Non-malicious) non-malicious (a) The articles do not provide information about a software failure incident related to a malicious attack or intent to harm the system. Therefore, there is no evidence of a malicious software failure incident in the provided articles [72205]. (b) The articles discuss software failure incidents related to non-malicious factors, specifically cosmic-ray neutrons causing memory errors in supercomputers at Los Alamos National Laboratory. These incidents were not intentional but were a result of external factors affecting the hardware and software of the supercomputers [72205].
Intent (Poor/Accidental Decisions) unknown (a) The intent of the software failure incident related to poor_decisions: - The article discusses how the Los Alamos National Laboratory had to adapt to account for cosmic-ray neutrons in their hard- and software after experiencing memory errors in their supercomputers [72205]. - It mentions that the lab's engineers understood the impact of neutrons on their equipment after the Q supercomputer crashed more than expected, leading them to perform cosmic stress-tests before installing new equipment like the Trinity machine [72205]. - The engineers intentionally crash the computers when errors occur, similar to falling down on purpose while skiing to avoid a worse outcome [72205]. (b) The intent of the software failure incident related to accidental_decisions: - The article highlights how cosmic-ray neutrons can corrupt processor parts and cause memory errors in supercomputers, which was initially unattributable until researchers discovered the impact of cosmic rays [72205]. - It mentions that neutron-induced silent data corruption can occur when bits flip without being noticed, emphasizing the importance of preemptive work to detect and address such issues [72205]. - The article discusses the need for human intervention to verify results before providing answers, indicating a cautious approach to ensure accuracy in critical research areas [72205].
Capability (Incompetence/Accidental) accidental (a) The articles do not provide information about a software failure incident related to development incompetence. (b) The articles mention a software failure incident related to accidental factors. Specifically, the incident involves cosmic-ray neutrons causing memory errors in supercomputers at Los Alamos National Laboratory. These errors are attributed to the impact of cosmic rays on processor parts, corrupting their data [72205].
Duration unknown The articles do not provide specific information about a software failure incident being permanent or temporary.
Behaviour crash, other (a) crash: The articles mention intentional crashes of the supercomputers at Los Alamos National Laboratory. When the system detects too many bits flipped, it crashes the computers intentionally, likened to falling down on purpose when skiing to prevent further issues [72205]. (b) omission: There is no specific mention of a software failure incident related to omission in the provided articles. (c) timing: The articles do not discuss a software failure incident related to timing issues. (d) value: The articles do not provide information about a software failure incident related to the system performing its intended functions incorrectly. (e) byzantine: The articles do not describe a software failure incident related to the system behaving erroneously with inconsistent responses and interactions. (f) other: The articles mention the concept of "silent data corruption," where bits flip due to cosmic ray neutrons, but no one notices the error. This could be considered a form of failure where the system is not behaving as expected without overtly crashing or providing incorrect results [72205].

IoT System Layer

Layer Option Rationale
Perception processing_unit The software failure incident described in the articles is related to the processing_unit layer of the cyber physical system that failed due to contributing factors introduced by processing error. The incident involved cosmic-ray neutrons slamming into processor parts, corrupting their data, and causing memory errors in supercomputers like the Cray-1 and Q at Los Alamos National Laboratory [72205]. The article highlights how cosmic rays, specifically neutrons, can cause computer memory to flip bits, leading to errors in data processing [72205]. Additionally, the engineers at Los Alamos perform cosmic stress-tests on new equipment like the Trinity machine by exposing the electronics to a beam of neutrons to preemptively address potential issues caused by cosmic rays [72205].
Communication unknown Unknown
Application FALSE The software failure incident described in the articles is not related to the application layer of the cyber physical system. The failure discussed in the articles is primarily attributed to cosmic-ray neutrons causing memory errors in supercomputers at Los Alamos National Laboratory, rather than issues related to bugs, operating system errors, unhandled exceptions, or incorrect usage typically associated with the application layer of a system. Therefore, the failure described in the articles does not align with the definition provided for application layer failures [72205].

Other Details

Category Option Rationale
Consequence non-human, theoretical_consequence (a) unknown (b) unknown (c) unknown (d) unknown (e) unknown (f) The software failure incident described in the articles did not directly impact human lives or physical well-being. Instead, the incident affected the operation of supercomputers at Los Alamos National Laboratory due to cosmic-ray neutrons causing memory errors and data corruption [72205]. (g) no_consequence (h) theoretical_consequence (i) The articles discuss the potential consequences of cosmic-ray neutrons causing memory errors and silent data corruption in supercomputers, which could lead to incorrect results in simulations and research conducted by Los Alamos National Laboratory [72205].
Domain knowledge, government (a) The failed system mentioned in the article is related to the industry of knowledge, specifically in the context of high-performance computing for research purposes at Los Alamos National Laboratory. The supercomputers discussed in the article, such as the Cray-1 and the Q supercomputer, were used for calculations related to nuclear weapons and other scientific research [72205].

Sources

Back to List