Incident: Title: Nasdaq OMX Group Trading Halt Due to Software Bug

Published Date: 2013-08-29

Postmortem Analysis
Timeline 1. The software failure incident at Nasdaq OMX Group happened on August 22, 2013 [20841].
System 1. Securities Information Processor (SIP) - The SIP failed due to a software bug and internal technology issues, specifically a flaw in the software code that prevented the processor's built-in backup system from resetting properly [20841].
Responsible Organization 1. Nasdaq OMX Group [20841] 2. NYSE Euronext's Arca exchange [20841]
Impacted Organization 1. Nasdaq OMX Group [20841] 2. NYSE Euronext's Arca exchange [20841]
Software Causes 1. The software causes of the failure incident were: - A software bug in the system that didn't fail over properly [20841]. - A latent flaw in the SIP's software code that prevented the processor's built-in backup system from resetting properly [20841].
Non-software Causes 1. Problems at NYSE Euronext's Arca exchange triggering internal technology issues at Nasdaq [20841] 2. Connection problems between NYSE's Arca exchange and Nasdaq's system [20841] 3. Capacity erosion due to Arca connecting and disconnecting to the SIP multiple times [20841] 4. Sending inaccurate stock symbols to the SIP, generating rejection messages [20841] 5. Traffic from Arca exceeding the planned capacity of the SIP [20841]
Impacts 1. The software failure incident led to a three-hour trading halt on August 22, causing disruption in the stock market [20841]. 2. The glitch in the backup system and the software bug prevented the system from fully reverting to backup mode, impacting the ability to resume trading promptly [20841]. 3. The software flaw in the Securities Information Processor (SIP) caused delays in resetting the backup system properly, further prolonging the return of data and the resumption of trading activities [20841]. 4. The incident revealed a latent flaw in the SIP's software code, highlighting potential vulnerabilities in the system [20841]. 5. Nasdaq expressed deep disappointment in the software failure incident and acknowledged that the performance was unacceptable to members, issuers, and the investing public, emphasizing the need for continuous improvement in technology operations [20841].
Preventions 1. Implementing thorough testing procedures for the backup system to ensure it can fail over properly [20841]. 2. Conducting regular capacity planning and monitoring to prevent exceeding planned capacity limits [20841]. 3. Enhancing the software code of the Securities Information Processor (SIP) to address latent flaws that could lead to failures [20841]. 4. Improving communication and coordination between different exchanges to prevent connection problems that can trigger software failures [20841].
Fixes 1. Implementing potential design changes to make the Securities Information Processor (SIP) more resilient, including architectural improvements, information security enhancements, disaster recovery plans, and capacity parameters [20841].
References 1. Nasdaq OMX Group [20841] 2. NYSE Euronext's Arca exchange [20841] 3. Bob Greifeld, Nasdaq’s chief executive [20841] 4. New York Stock Exchange parent NYSE [20841] 5. U.S. exchanges and the Financial Industry Regulatory Authority [20841]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident related to Nasdaq OMX Group's trading halt on August 22, 2013, was due to a software bug and internal technology issues that caused the backup system to fail. Nasdaq's chief executive mentioned that there was a bug in the system that didn't fail over properly, leading to the outage. Nasdaq expressed deep disappointment over the incident and emphasized the need to work hard to prevent such failures from happening again within the organization [20841]. (b) The software failure incident at Nasdaq OMX Group on August 22, 2013, involved connection problems between Nasdaq's system and NYSE Euronext's Arca exchange. This incident affected multiple organizations as it impacted the trading activities of various U.S. exchanges and the Financial Industry Regulatory Authority (FINRA). Nasdaq mentioned plans to present recommendations for changes to the SIP governing committee, which is composed of U.S. exchanges and FINRA, within 30 days to address the software flaw and improve system resilience [20841].
Phase (Design/Operation) design, operation (a) The software failure incident at Nasdaq was primarily attributed to a design issue. Nasdaq mentioned that the massive trading halt was caused by a software bug and other internal technology issues triggered by problems at NYSE Euronext's Arca exchange, leading to a key backup system failure. Nasdaq's chief executive, Bob Greifeld, acknowledged that there was a bug in the system that didn't fail over properly, indicating a design flaw in the backup system [20841]. (b) The software failure incident also had elements related to operation. Nasdaq reported that on the morning of the incident, Arca connected and disconnected to the Securities Information Processor (SIP) more than 20 times, causing capacity issues. Additionally, Arca sent a stream of inaccurate stock symbols to the SIP, generating numerous rejection messages, which further eroded the SIP's capacity. These operational issues contributed to the failure of the SIP and revealed a latent flaw in the software code [20841].
Boundary (Internal/External) within_system, outside_system (a) within_system: The software failure incident at Nasdaq OMX Group was primarily attributed to a software bug and other internal technology issues within the system itself. Nasdaq mentioned that there was a bug in the system that didn't fail over properly, leading to the backup system not working as intended [20841]. Additionally, the software flaw within the Securities Information Processor (SIP) prevented the built-in backup system from resetting properly, further highlighting an internal issue within the system [20841]. (b) outside_system: The incident was also influenced by problems at NYSE Euronext's Arca exchange, which connected and disconnected to the SIP multiple times, causing capacity issues and sending inaccurate stock symbols to the SIP, ultimately contributing to the failure [20841]. This external factor from the NYSE Euronext's Arca exchange played a role in the software failure incident at Nasdaq.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident was primarily attributed to non-human actions, specifically a software bug and internal technology issues triggered by problems at NYSE Euronext's Arca exchange that led to a key backup system failure at Nasdaq [20841]. The failure was described as a confluence of events that exceeded the planned capacity of the Securities Information Processor (SIP) and revealed a latent flaw in the software code of the SIP, preventing the backup system from resetting properly [20841]. (b) While the software failure incident was mainly caused by non-human actions, there was also a mention of human actions in the form of the need for Nasdaq to work hard to ensure such incidents do not happen again [20841]. The article highlighted that Nasdaq's chief executive acknowledged the bug in the system and the failure of the backup system to revert properly, indicating a recognition of the need for human intervention to prevent similar failures in the future.
Dimension (Hardware/Software) software (a) The software failure incident at Nasdaq OMX Group was primarily attributed to a software bug and other internal technology issues, rather than hardware problems. Nasdaq's chief executive mentioned that there was a bug in the system that didn't fail over properly, indicating a software-related issue [20841]. (b) The software failure incident was specifically linked to a software bug in the system that prevented the backup system from resetting properly, leading to the failure of the Securities Information Processor (SIP) and the subsequent trading halt. Nasdaq acknowledged the software flaw in the SIP's code as a key factor in the incident [20841].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident at Nasdaq OMX Group was non-malicious. The incident was attributed to a software bug and internal technology issues triggered by problems at NYSE Euronext's Arca exchange, leading to a key backup system failure. Nasdaq's chief executive mentioned that there was a bug in the system that didn't fail over properly, indicating an unintentional failure [20841].
Intent (Poor/Accidental Decisions) unknown (a) The software failure incident at Nasdaq OMX Group was primarily attributed to a software bug and internal technology issues triggered by problems at NYSE Euronext's Arca exchange, leading to a key backup system failure. Nasdaq's chief executive mentioned that there was a bug in the system that didn't fail over properly, indicating a technical flaw rather than poor decisions as the root cause of the failure [20841].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The software failure incident at Nasdaq OMX Group was primarily attributed to a software bug and other internal technology issues, including problems at NYSE Euronext's Arca exchange that led to a key backup system failure. Nasdaq's chief executive mentioned that there was a bug in the system that didn't fail over properly, indicating a failure due to development incompetence [20841]. (b) The incident also involved accidental factors, such as connection problems between NYSE's Arca exchange and Nasdaq's system, which led to the system being overwhelmed with inaccurate stock symbols and messages, exceeding its planned capacity. This accidental overload caused the failure and revealed a latent flaw in the software code of the Securities Information Processor (SIP) [20841].
Duration temporary (a) The software failure incident described in the article was temporary. It was a three-hour outage on August 22 caused by a software bug and internal technology issues triggered by problems at NYSE Euronext's Arca exchange, leading to the failure of a key backup system [20841]. The incident was resolved within 30 minutes, and then nearly three more hours were spent on testing and evaluating scenarios to reopen the market in a fair and orderly manner. The article mentions that the software flaw prevented the processor's built-in backup system from resetting properly, causing a delay in the return of data [20841].
Behaviour crash, omission, value, other (a) crash: The software failure incident at Nasdaq OMX Group was described as a crash where the backup system did not work properly, leading to a three-hour outage on August 22, 2013. Nasdaq's chief executive mentioned that there was a bug in the system, and it didn't fail over properly, resulting in the system not fully reverting to backup mode [20841]. (b) omission: The software failure incident also involved omission as the system failed to handle the connection problems between NYSE's Arca exchange and Nasdaq's system, leading to the failure of the backup system. This omission of proper handling of connections and capacity issues contributed to the overall failure [20841]. (c) timing: The timing aspect of the software failure incident was evident in the delay in resolving the problem and reopening the market in a fair and orderly manner. It took Nasdaq 30 minutes to resolve the initial problem and then nearly three more hours to test and evaluate scenarios before reopening the market [20841]. (d) value: The software failure incident also involved a failure in value as the system processed inaccurate stock symbols from Arca, generating numerous rejection messages. This incorrect processing of data contributed to the failure of the Securities Information Processor (SIP) and revealed a latent flaw in the software code [20841]. (e) byzantine: There is no specific mention of the software failure incident exhibiting a byzantine behavior in the provided article. (f) other: The software failure incident also involved a combination of factors that led to the failure, including capacity issues, connection problems, inaccurate data processing, and a latent flaw in the software code. The confluence of these events vastly exceeded the planned capacity of the SIP, causing its failure and revealing the software flaw [20841].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, non-human The consequence of the software failure incident reported in the article was a delay in trading activities. Nasdaq OMX Group experienced a three-hour outage on August 22 due to a software bug and internal technology issues, leading to the halt in trading. The software failure caused the system to degrade, prompting Nasdaq to stop trading to ensure fair market conditions. It took 30 minutes to resolve the problem and nearly three more hours to test and evaluate scenarios to reopen the market in a fair and orderly manner [20841].
Domain finance (a) The failed system was related to the finance industry as it affected the Nasdaq stock exchange, which is a key player in the financial markets [20841].

Sources

Back to List