Incident: TSB Online Banking Meltdown Caused by IT Upgrade Failure

Published Date: 2018-04-26

Postmortem Analysis
Timeline 1. The software failure incident at TSB occurred in April 2018. [70109]
System 1. TSB's IT "upgrade" [70109] 2. TSB's user authentication processes [70109]
Responsible Organization 1. Banco Sabadell - The primary blame for the IT "upgrade" fiasco at TSB was attributed to Banco Sabadell, TSB's parent company [70109]. 2. TSB - TSB's CEO mentioned that the IT meltdown was caused by issues within their architecture, including capacity constraints, which the IBM team was tasked to resolve [70109].
Impacted Organization 1. TSB customers [70109, 70109, 70109] 2. Small businesses unable to make payments to HMRC [70109]
Software Causes 1. Capacity constraints in the IT architecture, potentially related to issues with datacenters, fiber-optic cables, and software layers [70109]. 2. User authentication processes not scaling properly, leading to errors and difficulties for internet banking customers [70109]. 3. Alarming error messages related to programming language Java, such as BeanFactories and NullPointerExceptions, causing confusion among customers [70109].
Non-software Causes 1. Lack of communication and transparency from TSB's Spanish parent company, Banco Sabadell, regarding the IT upgrade [70109]. 2. Capacity constraints within TSB's IT architecture, including issues with data centers, fiber-optic cables, and software layers [70109].
Impacts 1. Customers were unable to access their accounts online, leading to inconvenience and financial difficulties [70109]. 2. TSB staff had to work hard to resolve technical problems and gain back customer trust [70109]. 3. TSB customers experienced embarrassment and panic due to declined cards and difficulties in accessing funds [70109]. 4. Small businesses faced challenges in making payments to HMRC due to the IT meltdown [70109]. 5. TSB started paying compensation to affected customers for the inconvenience caused [70109].
Preventions 1. Thorough testing and validation of the IT upgrade before implementation could have potentially prevented the software failure incident [70109]. 2. Ensuring proper scalability of user authentication processes to handle the load of internet banking customers could have helped prevent the issues faced by TSB [70109].
Fixes 1. Identifying and resolving the issues causing capacity constraints within the architecture, such as problems with data centers, fiber-optic cables, and software layers [70109]. 2. Ensuring proper scaling of user authentication processes to allow internet banking customers to access the service without errors [70109]. 3. Implementing solutions to prevent alarming error messages like BeanFactories and NullPointerExceptions from affecting users [70109]. 4. Timely resolution of the IT problems by the IBM team of experts brought in by TSB, with a deadline set for Saturday [70109].
References <Article 70109> gathers information about the software failure incident from the following entities: 1. Paul Pester, TSB's chief executive [70109] 2. TSB customers, such as Richard Brittain and Claire McAdam [70109] 3. IT expert Simon Needham [70109] 4. Broadcast journalist Natalia Crawford [70109] 5. HMRC (Her Majesty's Revenue and Customs) [70109] 6. IBM team of experts [70109]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The incident at TSB involving a botched IT "upgrade" and online banking meltdown indicates a recurring issue within the organization itself. The CEO mentioned that they are facing issues with their data centers, fiber-optic cables, and software layers, which are causing capacity constraints [70109]. This suggests that TSB has faced similar technical challenges in the past, leading to a repeated software failure incident within the organization. (b) The software failure incident having happened again at multiple_organization: There is no specific information in the provided articles indicating that a similar software failure incident has occurred at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident at TSB was related to the development phase, particularly in the design aspect. The incident was attributed to issues within the IT architecture, including capacity constraints caused by problems in the system's design and software layers [70109]. Additionally, error messages related to programming language Java and user authentication processes not scaling properly indicated underlying design flaws in the system [70109]. (b) The software failure incident at TSB also had implications related to the operation phase. Customers faced difficulties in accessing their accounts online, experiencing card declines, and being unable to make transactions, highlighting operational issues caused by the failure [70109]. Furthermore, the inability of small businesses to make payments to HMRC due to the IT meltdown showcased the operational impact on users trying to operate their businesses [70109].
Boundary (Internal/External) within_system (a) within_system: The software failure incident at TSB was primarily attributed to issues within the system itself. TSB's CEO mentioned that the IT meltdown was caused by issues within their architecture, such as capacity constraints and problems with user authentication processes [70109]. Additionally, error messages related to programming language Java and issues like NullPointerExceptions were being displayed to customers, indicating internal software issues [70109]. The CEO also mentioned that IBM experts were working to identify and resolve the issues within TSB's system [70109].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure at TSB was attributed to issues within the IT architecture, specifically mentioning capacity constraints caused by issues within the datacentres, fiber-optic cables, and software layers [70109]. - Error messages related to BeanFactories and NullPointerExceptions were being displayed to users, indicating potential issues with user authentication processes not scaling properly [70109]. (b) The software failure incident occurring due to human actions: - TSB's CEO, Paul Pester, mentioned that he was unsure about the exact cause of the IT meltdown and had IBM experts working on identifying the issues within the IT architecture [70109]. - There were criticisms of TSB's handling of the situation, with concerns about the CEO's statements and tweets being nonsensical or factually incorrect, indicating potential human errors in communication and decision-making [70109]. - Customers and small businesses faced challenges accessing their accounts and making payments due to the IT meltdown, highlighting the impact of human actions on the failure incident [70109].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - Paul Pester mentioned that the IT meltdown was caused by issues within the bank's architecture involving data centers, fiber-optic cables, and layers of software, and he had IBM experts working to identify the specific hardware-related issues [70109]. - TSB's user authentication processes were suspected to not be scaling properly, leading to errors and issues for customers trying to access internet banking services, indicating a hardware-related scalability problem [70109]. (b) The software failure incident occurring due to software: - TSB experienced alarming error messages related to programming language Java and NullPointerExceptions, indicating software-related issues in the system [70109]. - The article mentions that TSB's online banking customers were facing issues due to software errors being displayed to users, suggesting software-related problems in the system [70109].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident at TSB appears to be non-malicious. The incident was attributed to a botched IT "upgrade" that caused an IT meltdown, preventing customers from accessing their accounts online [70109]. TSB's CEO mentioned that the specific issue causing the meltdown was related to capacity constraints within their IT architecture, which IBM experts were working to resolve [70109]. Additionally, error messages related to user authentication processes not scaling properly were observed, indicating technical issues rather than malicious intent [70109]. (b) The software failure incident at TSB does not seem to be malicious. The articles do not mention any intentional actions or malicious intent behind the IT meltdown. Instead, the focus is on technical issues, errors in the IT upgrade, and capacity constraints within the IT architecture as the primary causes of the failure [70109]. Customers and staff were affected by the incident, with efforts made to compensate customers for the inconvenience caused [70109]. HMRC also indicated understanding for small businesses unable to make payments due to the TSB IT meltdown, suggesting a non-malicious nature of the incident [70109].
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The software failure incident at TSB seems to have been caused by poor decisions. The article mentions that the primary blame for the IT "upgrade" fiasco lies with Banco Sabadell, TSB's parent company in Spain [70109]. Additionally, the TSB CEO, Paul Pester, has been criticized for his handling of the fallout from the botched IT upgrade, with some of his statements being nonsensical and factually wrong [70109]. (b) The software failure incident at TSB also seems to have been influenced by accidental decisions or mistakes. Paul Pester mentioned that he doesn't yet know exactly what caused the IT meltdown and has IBM experts working to identify the issues within the complex IT architecture [70109]. Additionally, there were reports of alarming error messages confusing customers, indicating potential mistakes in the software implementation [70109].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident occurring due to development incompetence: - The TSB IT meltdown was attributed to issues in the architecture causing capacity constraints, with TSB's CEO mentioning the involvement of IBM experts to resolve the problem [70109]. - IT expert Simon Needham pointed out that TSB's user authentication processes weren't scaling properly, leading to errors visible to users [70109]. (b) The software failure incident occurring accidentally: - TSB CEO Paul Pester mentioned that he didn't know exactly what caused the IT meltdown, indicating a lack of clarity on the specific issue [70109].
Duration temporary (a) The software failure incident in this case seems to be temporary. TSB's CEO, Paul Pester, mentioned that IBM experts are working to identify and resolve the issues causing capacity constraints within the bank's IT architecture [70109]. Additionally, TSB is expecting the IBM team to fix the problems by Saturday, indicating that the disruption is not permanent and there is an expectation of recovery by the start of the next week [70109].
Behaviour crash, omission, value, other (a) crash: The software failure incident mentioned in the articles can be associated with a crash behavior. Customers were unable to access their accounts online, indicating a system crash where the system lost its state and failed to perform its intended functions [70109]. (b) omission: The incident also involved omission behavior where the system omitted to perform its intended functions at instances. Customers faced issues such as not being able to make bank transfers or access their accounts, indicating instances where the system failed to perform as expected [70109]. (c) timing: There is no specific mention of timing-related failures in the articles. (d) value: The software failure incident can be linked to a value behavior where the system performed its intended functions incorrectly. For example, customers experienced issues with missing money from their accounts and were unable to make payments, indicating incorrect functioning of the system [70109]. (e) byzantine: The articles do not provide information suggesting a byzantine behavior in the software failure incident. (f) other: The software failure incident also exhibited other behaviors such as displaying alarming error messages to customers, causing confusion and frustration. Additionally, the incident led to customers feeling embarrassed, panicked, and worried due to the system's failures, impacting their trust in the bank [70109].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence (a) unknown (b) unknown (c) unknown (d) Customers were impacted as they were unable to access their accounts online, make bank transfers, or pay for transactions, leading to inconvenience and financial issues [70109]. (e) Customers had to delay their financial transactions and activities due to the software failure [70109]. (f) unknown (g) Customers faced real consequences such as being unable to access their accounts, experiencing delays in transactions, and feeling embarrassed or panicked due to the software failure [70109]. (h) The potential consequence discussed was that small businesses might not be able to make payments to tax authorities due to the TSB IT meltdown, but HMRC stated they would take the circumstances into account and not penalize late payments if there was a reasonable excuse [70109]. (i) unknown
Domain finance (a) The failed system was related to the finance industry, specifically affecting TSB's online banking services [70109]. (h) The software failure incident impacted TSB's online banking services, which are crucial for manipulating and moving money for profit in the finance industry [70109].

Sources

Back to List