Incident: Memory Glitch Causes Curiosity Rover to Enter Safe Mode

Published Date: 2016-07-11

Postmortem Analysis
Timeline 1. The software failure incident with NASA's Curiosity rover happened on March 6, as reported in Article 82564. 2. The software mismatch incident occurred on July 2, as mentioned in Article 46104. 3. The "hiccup during boot-up" incident took place the previous Friday before the article was published on February 22, as per Article 81464. Therefore, the timeline for the software failure incidents are: - March 6, 2019 - July 2, 2016 - February 15, 2019
System 1. Curiosity's Side-A computer [Article 82564] 2. Image data transfer mode in the rover's software [Article 46104] 3. Boot-up process of Curiosity [Article 81464]
Responsible Organization 1. The software failure incident on the Curiosity rover on Mars was caused by a "hiccup during boot-up" triggered by a memory glitch, which was related to the computer's memory [81464]. 2. Another software failure incident on the Curiosity rover was caused by a 'software mismatch' in the rover's software as it transferred images, leading to the rover entering safe mode [46104].
Impacted Organization 1. NASA [82564, 46104, 81464]
Software Causes 1. The software failure incident on the Curiosity rover was caused by a "software mismatch" in the rover's software as it transferred images [Article 46104]. 2. The incident also involved a "hiccup during boot-up" that triggered the rover to go into safe mode, interrupting its planned science activities [Article 81464].
Non-software Causes 1. Memory glitch related to the computer's memory [82564] 2. Software mismatch in how image data are transferred on board [46104]
Impacts 1. The software failure incident on Curiosity's Side-A computer caused the rover to go into safe mode, interrupting its planned science activities [82564]. 2. NASA had to switch the rover over to its Side-B computer to address the memory glitch issue on Side-A, impacting the rover's operations [82564]. 3. The software mismatch issue during image data transfer caused Curiosity to enter safe mode, ceasing most activities until the issue was resolved [46104]. 4. The software failure incident on Curiosity led to a halt in its science operations while NASA investigated the problem, delaying its exploration activities [81464].
Preventions 1. Regular software updates and patches to ensure the software is up-to-date and free from known issues [82564, 46104]. 2. Thorough testing of software updates and changes before deployment to prevent unexpected glitches [82564]. 3. Implementing redundancy by having backup systems or computers like the Side-B computer in the case of failures with the primary system [82564]. 4. Conducting proactive monitoring and analysis of system performance to detect early signs of potential failures [82564]. 5. Enhancing memory management and error handling mechanisms within the software to prevent memory-related glitches [82564].
Fixes 1. NASA switched the Curiosity rover over to its Side-B computer to address the memory glitch in the Side-A computer [82564]. 2. The rover team reformatted the Side-B computer to isolate bad memory areas in an attempt to fix the memory issues [82564]. 3. NASA is investigating the software issue by taking a snapshot of Curiosity's memory to better understand what might have happened [81464].
References 1. NASA update on Curiosity's Side-A computer reset incident [Article 82564] 2. NASA confirmation of Curiosity entering safe mode due to a software mismatch during image data transfer [Article 46104] 3. NASA announcement of Curiosity experiencing a "hiccup during boot-up" triggering safe mode [Article 81464]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The software failure incident with the Curiosity rover experiencing glitches and going into safe mode due to a "hiccup during boot-up" has happened before within the same organization, NASA. In Article 82564, it is mentioned that Curiosity experienced a reset on its Side-A computer, triggering the rover's safe mode. This incident was the second time the computer unexpectedly reset in the last three weeks. NASA had also faced a memory glitch with Curiosity in late 2018, which led to switching the rover's "brains" from Side-B to Side-A and then back to Side-B [82564]. (b) The software failure incident having happened again at multiple_organization: There is no information in the provided articles about the software failure incident happening again at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in Article 46104, where it is mentioned that the Curiosity Mars rover experienced a 'software mismatch' during the transfer of images, causing it to enter safe mode [46104]. This indicates a failure due to contributing factors introduced by the system development or updates. (b) The software failure incident related to the operation phase can be observed in Article 82564, where it is reported that the Curiosity rover's Side-A computer experienced a reset, triggering the rover's safe mode. This incident occurred during the operation of the rover on Mars, indicating a failure due to contributing factors introduced by the operation of the system [82564].
Boundary (Internal/External) within_system, outside_system (a) The software failure incident related to the Curiosity rover experiencing glitches on Mars can be categorized as within_system. The articles mention that the resets and safe mode triggers were related to the computer's memory [82564]. Additionally, NASA confirmed that the entry into safe mode was caused by a 'software mismatch' in the rover's software as it transferred images [46104]. These issues were internal to the rover's system and memory management. (b) The software failure incident can also be considered as outside_system to some extent. The articles discuss how the Curiosity rover was affected by external factors such as dust storms on Mars, which led to the end of the Opportunity rover mission [82564]. However, the specific software glitches and resets experienced by Curiosity were primarily attributed to issues within the rover's system and memory [82564, 46104].
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - In Article 82564, it is mentioned that the Curiosity rover experienced a reset on its Side-A computer, triggering the rover's safe mode. This reset was related to the computer's memory, indicating a technical glitch rather than human error [82564]. - In Article 46104, it is reported that the Curiosity rover put itself into safe mode due to a 'software mismatch' in the rover's software as it transferred images. This issue was related to how image data was transferred on board, suggesting a non-human action causing the failure [46104]. (b) The software failure incident occurring due to human actions: - There is no specific mention in the articles of the software failure incidents being caused by human actions. Both incidents seem to be related to technical glitches or mismatches in the software rather than human errors.
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - Article 82564 mentions that NASA's Curiosity rover experienced glitches on Mars, with the latest problem being a reset of the Side-A computer triggering the rover's safe mode. NASA attributed these resets to the computer's memory, indicating a hardware-related issue [82564]. (b) The software failure incident occurring due to software: - Article 46104 reports that Curiosity Mars rover experienced a software mismatch issue on July 2nd, causing it to put itself into a safe standby mode. NASA confirmed that the problem was related to a software mismatch in how image data were transferred on board, leading to the rover entering safe mode [46104]. - Article 81464 also mentions a software failure incident where Curiosity experienced a "hiccup during boot-up," triggering the rover to go into safe mode and interrupting its planned science activities. NASA is investigating the issue by taking a snapshot of the rover's memory to understand what happened, indicating a software-related issue [81464].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incidents reported in the articles are non-malicious. In Article 82564, it is mentioned that the Curiosity rover experienced glitches and resets related to the computer's memory, prompting NASA to switch computers to address the issues. The resets were not intentional and were related to technical problems with the memory [82564]. In Article 46104, it is reported that the Curiosity rover entered safe mode due to a software mismatch in how image data were transferred on board, causing the rover to cease most activities until the issue was resolved [46104]. These incidents were not caused by malicious intent but rather by technical issues within the software and hardware systems of the rover.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to the Curiosity rover experiencing glitches on Mars was not due to poor decisions but rather technical issues such as memory glitches and software mismatches. NASA had to switch between the rover's Side-A and Side-B computers to address the memory-related resets and software mismatches that triggered safe mode [82564, 46104, 81464].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development incompetence: - In Article 46104, it was mentioned that the Curiosity Mars rover experienced a software mismatch in how image data were transferred on board, leading to the rover entering safe mode. This issue was attributed to a 'software mismatch' in the rover's software as it transferred images, indicating a failure related to software development [46104]. (b) The software failure incident occurring accidentally: - In Article 82564, it was reported that the Curiosity rover experienced glitches on Mars, with the Side-A computer resetting unexpectedly, triggering the rover's safe mode. NASA described the resets as related to the computer's memory, indicating an accidental software failure incident [82564].
Duration temporary (a) The software failure incident in the articles seems to be temporary. In Article 82564, it is mentioned that Curiosity's Side-A computer experienced a reset on March 6, triggering the rover's safe mode. This incident was described as a "hiccup during boot-up" and was related to the computer's memory. NASA switched the rover over to its Side-B computer to address the issue, indicating that the failure was not permanent [82564]. (b) The software failure incident in the articles was temporary as it was caused by specific circumstances related to a software mismatch during image data transfer. In Article 46104, it is stated that Curiosity put itself into safe mode on July 2 due to a 'software mismatch' in how image data was transferred on board. NASA determined that the most likely cause of entry into safe mode was this software mismatch, which affected the transfer of image data. The incident was resolved by avoiding the use of that specific mode for image data transfer, indicating that the failure was temporary and specific to certain circumstances [46104].
Behaviour crash (a) crash: - Article 82564 reports that NASA's Curiosity rover experienced a reset on its Side-A computer, triggering the rover's safe mode, which is considered a crash incident [82564]. - Article 81464 mentions that Curiosity experienced a "hiccup during boot-up," leading the rover to go into safe mode and interrupt its planned science activities, indicating a crash incident [81464]. (b) omission: - There is no specific mention of a software failure incident related to omission in the provided articles. (c) timing: - There is no specific mention of a software failure incident related to timing in the provided articles. (d) value: - There is no specific mention of a software failure incident related to value in the provided articles. (e) byzantine: - There is no specific mention of a software failure incident related to a byzantine behavior in the provided articles. (f) other: - The software failure incidents reported in the articles can be categorized as a crash due to the system losing state and not performing its intended functions as described in the articles [82564, 81464].

IoT System Layer

Layer Option Rationale
Perception processing_unit, embedded_software (a) sensor: The articles do not mention any sensor-related failures. (b) actuator: The articles do not mention any actuator-related failures. (c) processing_unit: The software failure incidents mentioned in the articles were related to the processing unit of the Curiosity rover. Specifically, in Article 82564, it is stated that the computer on the rover experienced resets related to memory issues, prompting NASA to switch between the Side-A and Side-B computers [82564]. Additionally, in Article 81464, it is mentioned that Curiosity experienced a "hiccup during boot-up," triggering the rover to go into safe mode and interrupting its planned science activities [81464]. (d) network_communication: The articles do not mention any network communication-related failures. (e) embedded_software: The software failure incidents mentioned in the articles were related to the embedded software of the Curiosity rover. In Article 46104, it is stated that the rover entered safe mode due to a 'software mismatch' in the rover's software as it transferred images [46104]. Additionally, in Article 81464, it is mentioned that Curiosity experienced a "hiccup during boot-up," triggering the rover to go into safe mode and interrupting its planned science activities [81464].
Communication unknown The software failure incidents reported in the provided articles do not specifically mention any issues related to the communication layer of the cyber physical system that failed. Therefore, it is unknown whether the failures were related to the link_level or connectivity_level.
Application TRUE The software failure incident related to the application layer of the cyber physical system that failed is described in Article 46104. In this article, it is mentioned that the Curiosity Mars rover experienced a software mismatch issue in one mode of how image data were transferred on board, which caused the rover to enter safe mode. NASA confirmed that the most likely cause of entry into safe mode was a software mismatch related to transferring image data, which falls under the category of application layer failures [Article 46104].

Other Details

Category Option Rationale
Consequence delay, non-human, other (a) death: People lost their lives due to the software failure - No mention of any deaths caused by the software failure incidents in the provided articles [82564, 46104, 81464]. (b) harm: People were physically harmed due to the software failure - No mention of any physical harm to people due to the software failure incidents in the provided articles [82564, 46104, 81464]. (c) basic: People's access to food or shelter was impacted because of the software failure - No mention of people's access to food or shelter being impacted by the software failure incidents in the provided articles [82564, 46104, 81464]. (d) property: People's material goods, money, or data was impacted due to the software failure - The software failure incidents did impact the operations of the Curiosity rover on Mars, but there is no mention of people's material goods, money, or data being directly impacted [82564, 46104, 81464]. (e) delay: People had to postpone an activity due to the software failure - The software failures did cause delays in the operations of the Curiosity rover on Mars. For example, in one incident, the rover had to switch computers entirely due to glitches, prompting delays in science operations [82564, 46104, 81464]. (f) non-human: Non-human entities were impacted due to the software failure - The software failures directly impacted the Curiosity rover, a non-human entity, causing it to go into safe mode, interrupting its planned science activities, and prompting the need for computer switches and memory reformatting [82564, 46104, 81464]. (g) no_consequence: There were no real observed consequences of the software failure - The software failures did have observable consequences on the operations of the Curiosity rover, such as going into safe mode, interrupting science activities, and requiring technical investigations and memory snapshots [82564, 46104, 81464]. (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur - The articles do not discuss potential consequences of the software failures that did not actually occur [82564, 46104, 81464]. (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? - The software failures led to the need for technical investigations, memory snapshots, computer switches, and memory reformatting to address the glitches and ensure the rover's functionality, impacting the rover's operations and mission progress [82564, 46104, 81464].
Domain knowledge (a) The failed system was related to the knowledge industry as it was supporting NASA's Curiosity rover mission on Mars, which involves space exploration and scientific research [82564, 46104, 81464].

Sources

Back to List