Incident: Memory Corruption Causes Delay in Curiosity Mars Rover Operations

Published Date: 2013-03-04

Postmortem Analysis
Timeline 1. The software failure incident with the Curiosity rover on Mars happened in February 2013 as mentioned in Article 17853. 2. The article was published on March 19, 2013. 3. Therefore, the software failure incident occurred in February 2013.
System 1. Side A computer of the Curiosity Mars rover 2. Side B computer of the Curiosity Mars rover 3. Memory system of the Curiosity Mars rover [17853, 17558]
Responsible Organization 1. Space radiation - The software failure incident on the Curiosity rover on Mars was suspected to have been caused by space radiation, specifically a "single-event upset" affecting the memory addresses [Article 17558].
Impacted Organization 1. Curiosity rover on Mars [17853, 17558]
Software Causes 1. Memory problems with one of the craft's two identical computers led to the postponement of science observations by Curiosity [Article 17853]. 2. A memory corruption issue was discovered in the rover's active flight computer, leading to the interruption of science operations and the need to switch operations to a backup flight computer [Article 17558]. 3. A software glitch occurred in the second computer of Curiosity, putting it into standby mode [Article 17853]. 4. Engineers suspected that the memory glitch might have been caused by space radiation, leading to a "single-event upset" that changed the state of memory addresses [Article 17558].
Non-software Causes 1. Space radiation causing a "single-event upset" in the memory, potentially leading to data corruption [Article 17558] 2. Potential energetic particles making it through radiation-hardened components and changing the state of memory addresses [Article 17558]
Impacts 1. Delay in science operations on the Curiosity rover on Mars for three weeks [Article 17853]. 2. Limited science operations and a planned monthlong hiatus in rover operations due to the Sun's alignment between Earth and Mars potentially corrupting communication signals [Article 17853]. 3. Memory corruption discovered in the rover's active computer leading to the need for a complex sequence of steps to switch operations to a backup flight computer [Article 17558]. 4. Interruption of science operations and putting the craft in a low-activity "safe mode" while the computer switch was implemented [Article 17558]. 5. Engineers taking their time to ensure the switchover to the backup computer is carried out correctly, leading to a delay in fully restoring operations [Article 17558].
Preventions 1. Regularly conducting thorough testing and quality assurance procedures on the software to detect and address any potential memory corruption issues before they impact operations [Article 17558]. 2. Implementing additional radiation-hardened components or shielding to protect the memory systems from space radiation-induced single-event upsets [Article 17558]. 3. Developing and implementing robust error-handling mechanisms within the software to detect and mitigate memory corruption issues in real-time [Article 17558]. 4. Enhancing redundancy and failover mechanisms within the software architecture to quickly switch operations to a backup system in case of memory corruption or other failures [Article 17558]. 5. Establishing a comprehensive monitoring and alert system to promptly detect any anomalies or faults in the memory systems and take proactive measures to prevent further issues [Article 17558].
Fixes 1. Uploading configuration files and parameters to the B-side computer to fully recover the system [Article 17558]. 2. Powering up the A-side computer without loading software to check the status of the nonvolatile memory and potentially rewrite memory blocks to flush corrupted data [Article 17558]. 3. Attempting a full reboot of the A-side computer to rewrite memory blocks and clear it to serve as a backup to the B-side computer [Article 17558]. 4. Bypassing corrupted memory locations with a software patch if the memory problem cannot be corrected [Article 17558].
References 1. Richard Cook, the project manager for Curiosity at NASA's Jet Propulsion Laboratory in Pasadena, Calif. [Article 17853, Article 17558] 2. CBS News [Article 17558]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident happened again at the same organization, NASA, with the Curiosity rover on Mars. The incident involved memory problems with one of the craft's two identical computers, leading to a delay in science operations. Engineers had to switch to the second computer while working on ways to avoid a recurrence of the problem. However, just before finishing the troubleshooting work, the second computer suffered a software glitch and put itself into standby mode [17853]. (b) The software failure incident involving memory corruption and the need to switch operations to a backup flight computer has occurred at other organizations as well. In the case of the Curiosity Mars rover, engineers discovered data corruption in the solid-state memory used by the rover's active flight computer, leading to the need for a complex procedure to switch operations to the backup computer. This incident highlights the challenges and risks associated with memory glitches and the importance of redundancy in critical systems [17558].
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in the articles. The incident with the Curiosity rover on Mars was due to memory problems that cropped up with one of the craft's two identical computers during science observations [17853]. Engineers had to switch to the second computer while working on ways to avoid a recurrence of the problem. However, just a day away from finishing the troubleshooting work, the second computer suffered a software glitch and put itself into standby mode, causing further delays [17853]. This indicates that the initial design or development of the system may have had vulnerabilities that led to these software glitches. (b) The software failure incident related to the operation phase is evident in the articles as well. The memory glitch that interrupted science operations on the Curiosity rover was discovered during its active operation when it failed to send back science data as expected and did not put itself to sleep during scheduled downtime [17558]. This forced flight controllers to put the craft in a low-activity "safe mode" while the computer switch was implemented [17558]. The incident occurred during the operational phase of the rover's mission, highlighting issues that arose during the operation or use of the system.
Boundary (Internal/External) within_system (a) The software failure incident related to the Curiosity rover on Mars was primarily within the system. The incident involved memory problems with one of the craft's two identical computers, leading to a delay in science operations [Article 17853]. Engineers had to switch to the second computer while troubleshooting the issue. Additionally, a software glitch occurred with the second computer, putting it into standby mode, which was resolved internally by the engineers [Article 17853]. The memory corruption discovered in the rover's active computer was also an internal issue that interrupted science operations and required a complex sequence of steps to switch operations to a backup flight computer [Article 17558]. The engineers were focused on resolving the memory glitch within the system by conducting a thorough analysis and taking steps to ensure the proper functioning of the backup computer [Article 17558].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident related to non-human actions: - The software glitch that put the Curiosity rover into standby mode was a non-human action, occurring just a day before finishing the troubleshooting work [Article 17853]. - The memory corruption discovered in the rover's active computer was suspected to be caused by space radiation, specifically a "single-event upset" where an energetic particle changed the state of memory addresses [Article 17558]. (b) The software failure incident related to human actions: - Engineers were working on ways to avoid a recurrence of the memory problems in the Curiosity rover's computers [Article 17853]. - Engineers were conducting a complex sequence of steps to switch operations to a backup flight computer on the rover [Article 17558].
Dimension (Hardware/Software) hardware, software (a) The software failure incident related to hardware: - The memory glitch on the Curiosity Mars rover was suspected to have been caused by space radiation, specifically a "single-event upset" where an energetic particle changed the state of memory addresses [Article 17558]. - Engineers planned to power up the A-side computer without loading software to check the status of the nonvolatile memory, indicating a hardware-related investigation into the memory corruption issue [Article 17558]. (b) The software failure incident related to software: - The Curiosity rover experienced memory problems with one of its computers, leading to a delay in science operations. Engineers switched to the second computer while working on ways to avoid a recurrence of the problem [Article 17853]. - The second computer suffered a software glitch and put itself into standby mode, which engineers were able to resolve by commanding it back into regular operations [Article 17853]. - The memory corruption discovered in the rover's active computer interrupted science operations, leading to the implementation of a computer switch to the backup flight computer [Article 17558]. - Engineers were cautious about rebooting the A-side computer and loading software to avoid potentially destroying evidence that could help identify the root cause of the memory corruption issue, indicating a software-related approach to handling the incident [Article 17558].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Curiosity rover on Mars was non-malicious. The incident was caused by memory problems and a software glitch that put one of the craft's computers into standby mode, leading to a delay in science operations [17853, 17558]. Engineers worked on troubleshooting the issue and implementing a complex procedure to switch operations to a backup flight computer to resolve the memory corruption discovered in the active computer [17558]. The incident was attributed to a memory glitch possibly caused by space radiation, specifically a "single-event upset" affecting the memory addresses [17558]. (b) There is no indication in the articles that the software failure incident was malicious.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident related to the Curiosity rover on Mars was not due to poor decisions but rather to accidental factors. The incident was caused by a memory glitch and a software glitch that interrupted science operations on the rover. Engineers had to switch operations to a backup flight computer to resolve the memory corruption issue discovered in the active computer [Article 17853, Article 17558]. The decision to switch to the backup computer was a deliberate and necessary step taken to address the technical problem, rather than a result of poor decisions.
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident related to development incompetence is evident in the articles. Engineers working on the Curiosity Mars rover encountered memory problems with one of the craft's computers, leading to a delay in science operations [Article 17853]. The incident involved a complex sequence of steps to switch operations to a backup flight computer, indicating the challenges faced due to the memory corruption discovered in the rover's active computer [Article 17558]. These issues highlight the impact of human error or lack of professional competence in the development or maintenance of the software systems onboard the rover.
Duration temporary From the provided articles [17853, 17558], the software failure incident related to the Curiosity rover on Mars can be categorized as a temporary failure. The incident involved memory problems and a software glitch that caused interruptions in science operations. Engineers were able to resolve the issues by switching to the backup computer system and implementing solutions to overcome the glitches. The duration of the software failure incident was temporary as the team was able to address the problems and resume limited science operations shortly after the incidents occurred.
Behaviour crash, omission, other (a) crash: The software failure incident related to the Curiosity rover on Mars involved a crash when the second computer suffered a software glitch and put itself into standby mode, causing a delay in science operations [17853]. (b) omission: The memory glitch in the active computer of the Curiosity rover led to the omission of sending back science data as expected and not putting itself to sleep during scheduled downtime, resulting in interrupted science operations [17558]. (c) timing: The software failure incident did not involve a timing issue as the system was not reported to perform its intended functions either too late or too early. (d) value: The software failure incident did not involve the system performing its intended functions incorrectly. (e) byzantine: The software failure incident did not exhibit a byzantine behavior with inconsistent responses and interactions. (f) other: The other behavior observed in the software failure incident was the need for a complex sequence of steps to switch operations to a backup flight computer, indicating a planned and systematic approach to address the memory corruption issue [17558].

IoT System Layer

Layer Option Rationale
Perception processing_unit, embedded_software (a) sensor: The software failure incident with the Curiosity rover on Mars was not directly related to a sensor error. The issue was primarily with memory problems in one of the craft's computers, leading to a switch to the backup computer [17853, 17558]. (b) actuator: The incident did not involve any actuator error. The focus was on resolving memory corruption in the rover's active flight computer and switching operations to the backup computer [17558]. (c) processing_unit: The software failure incident was related to the processing unit, specifically the active flight computer of the Curiosity rover. Engineers had to perform a complex procedure to switch operations to the backup computer due to memory corruption in the primary computer [17558]. (d) network_communication: The failure was not directly linked to network communication errors. The main issue was with memory corruption in the rover's computer systems, leading to the need to switch to the backup computer for operations [17558]. (e) embedded_software: The software failure incident was related to embedded software errors in the rover's computer systems. Engineers had to carefully manage the process of switching operations to the backup computer to address memory corruption in the primary computer [17558].
Communication unknown From the provided articles, there is no specific mention of the failure being related to the communication layer of the cyber-physical system that failed. The focus of the articles is on the memory problems and software glitches experienced by the Curiosity rover on Mars, leading to the need for a computer switch and troubleshooting related to memory corruption in the active flight computer [Article 17853, Article 17558].
Application FALSE The software failure incident related to the Curiosity Mars rover was not specifically related to the application layer of the cyber physical system. The failure was primarily attributed to memory problems and a software glitch that affected the rover's computers [17853, 17558]. Therefore, the failure was not directly linked to bugs, operating system errors, unhandled exceptions, or incorrect usage typically associated with the application layer.

Other Details

Category Option Rationale
Consequence property, delay, non-human, theoretical_consequence (a) death: There is no mention of any deaths resulting from the software failure incident in the provided articles [17853, 17558]. (b) harm: There is no mention of any physical harm to individuals resulting from the software failure incident in the provided articles [17853, 17558]. (c) basic: There is no mention of people's access to food or shelter being impacted due to the software failure incident in the provided articles [17853, 17558]. (d) property: The software failure incident did impact the Curiosity rover's operations and science observations, causing delays and interruptions in its mission [17853, 17558]. (e) delay: The software failure incident led to delays in the Curiosity rover's science operations, with a three-week delay initially and further delays as engineers worked to resolve the memory problems and software glitches [17853, 17558]. (f) non-human: The primary impact of the software failure incident was on the Curiosity rover itself, affecting its ability to conduct science operations on Mars [17853, 17558]. (g) no_consequence: The software failure incident did have consequences on the Curiosity rover's operations, leading to delays and interruptions in its mission [17853, 17558]. (h) theoretical_consequence: There were discussions about potential consequences of the software failure incident, such as the need to avoid a recurrence of the problem and concerns about communication signals being corrupted during a planned hiatus in rover operations due to the Sun's alignment between Earth and Mars [17853]. (i) other: There are no other consequences described in the articles beyond the impact on the Curiosity rover's operations and mission due to the software failure incident [17853, 17558].
Domain knowledge (a) The failed system was intended to support the industry of knowledge, specifically space exploration. The software failure incident occurred with the Curiosity rover on Mars, which is part of a $2.5 billion mission by NASA to explore Mars over two years [Article 17853]. (i) The software failure incident was related to the industry of knowledge, specifically space exploration. The Curiosity rover on Mars experienced memory problems with one of its computers, leading to a delay in science operations. Engineers had to switch to the backup computer to troubleshoot the issue and resume regular operations [Article 17853, Article 17558].

Sources

Back to List