Recurring |
unknown |
The articles do not provide specific information about a software failure incident happening again at the same organization or at multiple organizations. Therefore, the information to answer this question is 'unknown'. |
Phase (Design/Operation) |
design, operation |
(a) The software failure incident related to the design phase can be seen in the article where it discusses the challenges faced by supercomputers at Los Alamos National Laboratory due to cosmic-ray neutrons causing memory errors in the processors. The article mentions incidents where the early supercomputer Q crashed more than expected, leading scientists to worry about cosmic rays affecting the system [72205].
(b) The software failure incident related to the operation phase is evident in the article's description of how supercomputers at Los Alamos National Laboratory handle errors caused by cosmic-ray neutrons. The article explains that when a system detects too many flipped bits due to neutron-induced errors, it intentionally crashes the computers to prevent further issues, similar to falling down on purpose while skiing to avoid a worse outcome [72205]. |
Boundary (Internal/External) |
outside_system |
The software failure incident described in the articles is related to the boundary of the system. The incident involves failures caused by contributing factors that originate from outside the system, specifically from cosmic-ray neutrons coming from outer space. These cosmic-ray neutrons can slam into processor parts, corrupting their data and causing memory errors in the supercomputers at Los Alamos National Laboratory [72205]. The lab's engineers have had to adapt their hard- and software to account for these external factors, such as placing neutron detectors throughout the supercomputing center and performing cosmic stress-tests on new equipment to preempt potential problems [72205]. |
Nature (Human/Non-human) |
non-human_actions, human_actions |
(a) The software failure incident occurring due to non-human actions:
The article discusses how the Cray-1 supercomputer at Los Alamos National Laboratory experienced 152 unattributable memory errors during a six-month free trial period. It was later discovered that cosmic-ray neutrons from outer space could slam into processor parts, corrupting their data, leading to memory errors in the computer [72205].
(b) The software failure incident occurring due to human actions:
The article mentions that before Los Alamos installs new equipment, like its Trinity machine, engineers perform a cosmic stress-test by placing the electronics in a beam of neutrons to observe their behavior. This proactive testing is done by humans to preempt potential problems caused by cosmic rays on the supercomputers [72205]. |
Dimension (Hardware/Software) |
hardware |
(a) The software failure incident occurring due to hardware:
The articles mention a software failure incident related to hardware issues caused by cosmic-ray neutrons slamming into processor parts, corrupting their data. This hardware-related issue led to 152 unattributable memory errors in the Cray-1 supercomputer during a free trial at Los Alamos National Laboratory [72205].
(b) The software failure incident occurring due to software:
The articles do not provide specific information about a software failure incident directly caused by software-related factors. |
Objective (Malicious/Non-malicious) |
non-malicious |
(a) The articles do not provide information about a software failure incident related to a malicious attack or intent to harm the system. Therefore, there is no evidence of a malicious software failure incident in the provided articles [72205].
(b) The articles discuss software failure incidents related to non-malicious factors, specifically cosmic-ray neutrons causing memory errors in supercomputers at Los Alamos National Laboratory. These incidents were not intentional but were a result of external factors affecting the hardware and software of the supercomputers [72205]. |
Intent (Poor/Accidental Decisions) |
unknown |
(a) The intent of the software failure incident related to poor_decisions:
- The article discusses how the Los Alamos National Laboratory had to adapt to account for cosmic-ray neutrons in their hard- and software after experiencing memory errors in their supercomputers [72205].
- It mentions that the lab's engineers understood the impact of neutrons on their equipment after the Q supercomputer crashed more than expected, leading them to perform cosmic stress-tests before installing new equipment like the Trinity machine [72205].
- The engineers intentionally crash the computers when errors occur, similar to falling down on purpose while skiing to avoid a worse outcome [72205].
(b) The intent of the software failure incident related to accidental_decisions:
- The article highlights how cosmic-ray neutrons can corrupt processor parts and cause memory errors in supercomputers, which was initially unattributable until researchers discovered the impact of cosmic rays [72205].
- It mentions that neutron-induced silent data corruption can occur when bits flip without being noticed, emphasizing the importance of preemptive work to detect and address such issues [72205].
- The article discusses the need for human intervention to verify results before providing answers, indicating a cautious approach to ensure accuracy in critical research areas [72205]. |
Capability (Incompetence/Accidental) |
accidental |
(a) The articles do not provide information about a software failure incident related to development incompetence.
(b) The articles mention a software failure incident related to accidental factors. Specifically, the incident involves cosmic-ray neutrons causing memory errors in supercomputers at Los Alamos National Laboratory. These errors are attributed to the impact of cosmic rays on processor parts, corrupting their data [72205]. |
Duration |
unknown |
The articles do not provide specific information about a software failure incident being permanent or temporary. |
Behaviour |
crash, other |
(a) crash: The articles mention intentional crashes of the supercomputers at Los Alamos National Laboratory. When the system detects too many bits flipped, it crashes the computers intentionally, likened to falling down on purpose when skiing to prevent further issues [72205].
(b) omission: There is no specific mention of a software failure incident related to omission in the provided articles.
(c) timing: The articles do not discuss a software failure incident related to timing issues.
(d) value: The articles do not provide information about a software failure incident related to the system performing its intended functions incorrectly.
(e) byzantine: The articles do not describe a software failure incident related to the system behaving erroneously with inconsistent responses and interactions.
(f) other: The articles mention the concept of "silent data corruption," where bits flip due to cosmic ray neutrons, but no one notices the error. This could be considered a form of failure where the system is not behaving as expected without overtly crashing or providing incorrect results [72205]. |