Incident: Legacy IT Systems Strain Banking Operations, Leading to Payment Disruptions

Published Date: 2014-01-27

Postmortem Analysis
Timeline 1. The software failure incident at Lloyds Banking Group happened around the time the article was published on 2014-01-27 [23587]. Therefore, the software failure incident at Lloyds Banking Group occurred in January 2014.
System 1. Legacy systems at banks (30-40 years old) [23587] 2. ATM systems [23587] 3. Online banking systems [23587] 4. Mobile banking systems [23587] 5. Back-end systems [23587]
Responsible Organization 1. Legacy systems at banks, which were originally set up for branch banking and have been continuously modified over the years without a complete overhaul, leading to complexity and fragility in the systems [23587]. 2. Chronic underinvestment in IT systems by banks, resulting in quality checks being cut and insufficient resources allocated for maintaining and updating the systems [23587].
Impacted Organization 1. Customers of Lloyds Banking Group were impacted by the software failure incident, experiencing payment problems and being unable to make debit card transactions or withdraw cash from ATMs [23587]. 2. Customers of RBS's banking brands were also mentioned as having faced a previous "glitch" that caused some customers to go weeks without proper access to their accounts [23587].
Software Causes 1. Aging IT systems in need of overhaul due to legacy systems being 30-40 years old and not designed for modern banking channels [23587]. 2. Complex systems due to continuous bolt-on changes over the years, leading to increased complexity and interdependence of different elements [23587]. 3. Lack of understanding of the entire system structure due to new functions being written in different programming languages, on different machines, by different teams [23587]. 4. Chronic underinvestment in IT systems, leading to quality checks being cut and problems being compounded [23587]. 5. Increased frequency of breakdowns as new systems are layered on top of old systems without sufficient investment in robustness [23587].
Non-software Causes 1. Chronic underinvestment in IT systems, leading to lack of quality checks and maintenance [23587]. 2. Changes in banking channels and regulatory requirements necessitating constant tinkering with systems, making them more complex [23587]. 3. Legacy systems at banks being originally set up for branch banking and then needing to adapt to newer technologies like ATMs, online banking, and mobile banking [23587].
Impacts 1. Customers of Lloyds Banking Group experienced payment problems, with debit card transactions being declined and ATMs not dispensing cash for around three-and-a-half hours, highlighting the immediate impact on financial transactions [23587]. 2. The incident raised concerns about the reliance on aging IT systems in the banking sector, indicating a need for a serious overhaul to prevent future disruptions [23587]. 3. The failure led to customer inconvenience and frustration, as they were unable to access their accounts properly, emphasizing the importance of reliable IT systems in the modern banking environment [23587].
Preventions 1. Regular and adequate investment in IT systems: Chronic underinvestment in IT systems, as highlighted in the article, has been a significant factor contributing to software failure incidents [23587]. 2. Comprehensive system overhaul: Building IT systems from scratch or conducting a thorough overhaul to abstract out different elements and reduce reliance on each other could prevent cascading failures within the system [23587]. 3. Collaboration and shared back-office systems: Banks could consider collaborative approaches and sharing back-office systems to save money, time, and improve system reliability [23587].
Fixes 1. Investing heavily in building IT systems that customers can rely on, as acknowledged by RBS's chief executive after years of under-spending on computer systems [23587]. 2. Abstracting out different elements of the system to reduce reliance on each other, as suggested by experts in system development [23587]. 3. Collaborative approaches among banks to share back-office systems, potentially saving money and time while ensuring robust systems [23587].
References 1. David Bannister, editor of Banking Technology magazine [23587] 2. Ben Wilson, associate director of financial services for techUK [23587] 3. Jim McCall, managing director of the Unit [23587] 4. Colin Privett, UK managing director of software firm Cast [23587] 5. Ross McEwan, chief executive of RBS [23587] 6. Mark Holland, partner at the consultancy firm Holley Holland [23587]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization, multiple_organization (a) The software failure incident having happened again at one_organization: The article mentions that Lloyds Banking Group experienced payment problems due to a server failure, leading to debit card transactions being declined and ATMs not dispensing cash [23587]. This incident is similar to a previous "glitch" at RBS's banking brands that caused customers to go weeks without proper access to their accounts. Both incidents highlight the challenges faced by banks with aging IT systems in need of overhaul. (b) The software failure incident having happened again at multiple_organization: The article discusses how various high street banks, including Lloyds Banking Group and RBS, have faced IT issues due to aging systems and underinvestment in their IT infrastructure [23587]. These incidents indicate a broader industry issue where multiple organizations are struggling with the complexity of their legacy systems and the challenges of integrating new technologies and regulatory changes.
Phase (Design/Operation) design, operation (a) The articles highlight that software failures in the banking systems are often attributed to the complexities introduced during system development and updates. Legacy systems at banks, which have been continuously modified and added onto over the years, are described as resembling a "house of cards" where making changes to one part can have unforeseen consequences elsewhere [23587]. The article also mentions that new functions are typically written in different programming languages, on different machines, by different teams, making it challenging for any single person or team to fully understand the entire structure of the system, leading to delays in identifying and fixing issues [23587]. (b) In terms of operational factors contributing to software failures, the articles discuss how the continuous operation of banking systems without downtime for maintenance poses challenges for implementing necessary changes and updates. The article compares the situation to "trying to change the windscreen while you're driving down the M6," emphasizing the difficulty in conducting maintenance when the system is constantly in use [23587]. Additionally, the article mentions that the 24/7 nature of banking operations, with payments going through continuously, makes it nearly impossible to schedule maintenance windows, further complicating the operational aspects of maintaining reliable IT systems [23587].
Boundary (Internal/External) within_system, outside_system (a) The articles highlight that the software failure incidents in the banking systems were primarily within the system. The failures were attributed to factors such as aging IT systems in need of overhaul, legacy systems that were not designed to handle modern banking channels like online and mobile banking, and the complexity arising from continuous bolt-on changes to the systems [23587]. (b) Additionally, the articles mention that external factors such as regulatory changes, increased capital requirements, and chronic underinvestment in IT systems also contributed to the software failures within the banking systems [23587].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: The articles highlight that the software failure incidents in the banking systems were primarily attributed to aging IT systems, legacy systems, continuous bolt-on changes, and the complexity arising from the integration of various technologies and channels over time. These factors were introduced without direct human participation but contributed to the failures. The systems were described as resembling a "house of cards," where even small changes could have cascading effects on the overall system [23587]. (b) The software failure incident occurring due to human actions: On the other hand, human actions also played a role in the software failure incidents. The articles mention chronic underinvestment in IT systems, years of under-spending on computer systems, and the laying off of IT staff as contributing factors to the failures. Additionally, the prioritization of investments in more "sexy" or customer-facing technologies over the back-office systems was highlighted as a human decision affecting the reliability of the IT infrastructure [23587].
Dimension (Hardware/Software) hardware, software (a) The articles mention that the IT systems at high street banks have been facing issues due to aging IT systems that are in need of a serious overhaul. The systems are described as "legacy systems" that are 30-40 years old and were originally set up for branch banking but have been continuously modified to accommodate new technologies like ATMs, online banking, and mobile banking [23587]. These modifications and bolted-on changes have made the systems more complex and resemble a "house of cards" where making a change to a small part of the code can have far-reaching consequences, leading to failures originating in hardware components. (b) The articles also highlight that new functions in the banking systems are usually written in different programming languages, on different machines, by different teams, which makes it challenging for a single person or team to fully understand the entire structure of the system. This complexity in software development and integration contributes to software failures when changes are made or issues arise, leading to incidents where teams scramble to identify the root cause of the problem [23587].
Objective (Malicious/Non-malicious) non-malicious (a) The articles do not mention any malicious intent behind the software failure incident reported in the news articles [23587]. (b) The software failure incident discussed in the articles is attributed to non-malicious factors such as aging IT systems, legacy systems, continuous bolt-on changes, underinvestment in IT infrastructure, and the complexity arising from multiple programming languages and teams working on different functions of the system [23587]. These non-malicious factors have contributed to the challenges faced by banks in maintaining reliable and robust IT systems, leading to incidents like payment problems, server failures, and ATM disruptions.
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The articles highlight that the software failure incidents in the banking systems were partly due to poor decisions made over the years. There was chronic underinvestment in the systems, with banks laying off IT staff and cutting quality checks due to squeezed budgets [23587]. The legacy systems at banks, some of which are 30-40 years old, were not adequately updated to keep up with the evolving technology and regulatory changes, leading to a situation where new functions were added on top of old systems in a complex and interconnected manner [23587]. These poor decisions regarding underinvestment, lack of comprehensive updates, and reliance on outdated systems contributed to the software failures experienced by the banks. (b) The software failures were also a result of accidental decisions or unintended consequences. The systems were described as resembling a "house of cards," where making a change to a small piece of code could have far-reaching effects on other parts of the system, leading to failures [23587]. Additionally, the complexity of the systems, with different functions written in different languages by different teams, made it challenging to fully understand the entire structure of the system, causing delays in identifying and fixing problems when they occurred [23587]. These accidental decisions or unintended consequences in system design and maintenance also played a role in the software failure incidents.
Capability (Incompetence/Accidental) development_incompetence (a) The articles highlight the software failure incident in the banking sector as a result of development incompetence. It is mentioned that the banks' systems are outdated, with some legacy systems being 30-40 years old and not originally designed to handle modern banking channels like online and mobile banking [23587]. The complexity of the systems has increased over time due to continuous bolt-on changes rather than starting from scratch, leading to a situation where even small code changes can have widespread impacts on the system, causing failures [23587]. (b) The incident also reflects accidental failures caused by the continuous layering of new systems on top of old ones. This approach has led to breakdowns becoming more frequent, as mentioned in the articles. The lack of sufficient investment in the IT systems, chronic underinvestment, and the prioritization of spending on other "sexier" technologies rather than the back-office systems have contributed to the accidental failures in the banking IT infrastructure [23587].
Duration permanent, temporary The articles discuss software failure incidents that can be categorized as both temporary and permanent: (a) Permanent: The articles mention that the banks' IT systems are facing ongoing issues due to a combination of factors such as aging legacy systems, continuous bolt-on changes, complex structures, chronic underinvestment, and the challenge of overhauling systems while operations are ongoing [23587]. (b) Temporary: Specific incidents like the server failure at Lloyds Banking Group causing debit card transactions to be declined and ATMs to not dispense cash for around three-and-a-half hours represent temporary software failures that were resolved within a relatively short timeframe [23587].
Behaviour omission, other (a) crash: The articles mention incidents where customers were cut off from their cash due to server failures, leading to debit card transactions being declined and ATMs not dispensing cash [23587]. (b) omission: The articles discuss how some customers of RBS's banking brands went weeks without being able to access their accounts properly due to a "glitch," indicating an omission of the system to provide access to accounts [23587]. (c) timing: The articles highlight that the systems in banks are struggling to keep up with the real-time demands of modern banking, with the reconciliation between different banking channels becoming increasingly challenging due to the widening gulf between background processes and user activities [23587]. (d) value: There is no specific mention of the system performing its intended functions incorrectly in the articles. (e) byzantine: The articles describe how changes made to a small part of the code can have far-reaching consequences, causing issues in seemingly unrelated areas of the system, resembling a "house of cards" where a change in one area can lead to failures in another [23587]. (f) other: The articles also discuss the chronic underinvestment in IT systems in banks, leading to quality checks being cut, compounded by changes in regulations and increased capital requirements, which further strain the already complex and aging systems [23587].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, delay, theoretical_consequence (a) death: There is no mention of any deaths resulting from the software failure incident reported in the articles [23587]. (b) harm: There is no mention of physical harm to individuals due to the software failure incident reported in the articles [23587]. (c) basic: There is no mention of people's access to food or shelter being impacted due to the software failure incident reported in the articles [23587]. (d) property: The software failure incident impacted people's access to their money as debit card transactions were declined and ATMs were unable to dispense cash for around three-and-a-half hours [23587]. (e) delay: People had to delay their financial transactions and access to cash due to the software failure incident [23587]. (f) non-human: Non-human entities were not specifically mentioned to be impacted due to the software failure incident reported in the articles [23587]. (g) no_consequence: The software failure incident did have real observed consequences such as declined transactions and inability to access cash [23587]. (h) theoretical_consequence: The articles discuss potential consequences of continued breakdowns and more frequent incidents unless more money is spent on overhauling the systems [23587]. (i) other: There are no other consequences mentioned in the articles beyond those related to financial transactions and access to cash due to the software failure incident [23587].
Domain finance (a) The failed system was related to the finance industry, specifically affecting high street banks like Lloyds Banking Group and RBS [23587]. The incident involved server failures leading to payment problems, debit card transactions being declined, and ATMs not dispensing cash, highlighting the strain on the banking systems due to new technologies and regulations. The article discusses how the banking systems have evolved over the years to accommodate internet banking, ATMs, online spending, and mobile banking, leading to complex and interconnected systems that are prone to failures when changes are made [23587]. (h) The software failure incident was directly related to the finance industry, impacting the banking sector and customers' access to their accounts and cash transactions [23587]. The incident highlighted the challenges faced by banks in maintaining and updating their legacy systems, which were originally set up for branch banking but have been continuously modified to incorporate new technologies like ATMs, online banking, and mobile banking [23587]. The article also mentions the chronic underinvestment in IT systems by banks, leading to quality issues and breakdowns in the systems [23587].

Sources

Back to List