Incident: Cloud Server Firmware Vulnerability in IBM's Bare Metal Service

Published Date: 2019-02-26

Postmortem Analysis
Timeline 1. The software failure incident reported in Article 81120 happened on February 26, 2019.
System 1. IBM bare metal cloud servers' firmware [81120]
Responsible Organization 1. Researchers at the security firm Eclypsium [81120]
Impacted Organization 1. IBM [81120]
Software Causes 1. The software cause of the failure incident was the ability of hackers to alter the firmware of cloud servers, specifically in IBM's bare metal cloud computing service, allowing them to hide malicious code that persists even after the server is released back into the pool of available machines [81120].
Non-software Causes 1. Lack of proper sanitization of cloud servers' firmware by cloud service providers [81120]
Impacts 1. The software failure incident allowed attackers to alter the firmware of cloud servers, potentially planting malware that could spy on the server, alter its data, or destroy it at will [81120]. 2. The incident raised concerns about the security of bare metal cloud servers, where attackers could gain dangerous levels of access to components and persistently infect firmware, evading detection even after a complete wipe of the server's storage [81120]. 3. IBM responded to the incident by re-flashing all BMC firmware before reprovisioning servers to customers, erasing logs and regenerating passwords to mitigate the vulnerability [81120]. 4. Despite IBM's response, researchers were skeptical about the effectiveness of the fix, with concerns that the firmware could still be altered to give hackers control and deceive administrators during re-flashing [81120].
Preventions 1. Ensuring thorough sanitization of equipment at the deepest level, including firmware, by cloud service providers before customer use could have prevented the incident [81120]. 2. Implementing regular checks and updates on firmware integrity, potentially through hardware checks, could have helped prevent the persistence of malicious firmware alterations [81120].
Fixes 1. IBM has responded to the vulnerability by forcing all BMCs to be reflashed with factory firmware before reprovisioning to other customers [81120]. 2. Adding a piece of hardware to the server to check the firmware's integrity could fully solve the problem according to firmware hacker H. D. Moore [81120].
References 1. Eclypsium's researchers 2. Yuriy Bulygin, Eclypsium's founder and former head of Intel's advanced threat research team 3. IBM spokesperson 4. Karsten Nohl, security researcher who developed the BadUSB attack 5. H. D. Moore, well-known firmware hacker 6. IBM (mentioned in the article) 7. Security Research Labs (mentioned in the article)

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident related to altering firmware in cloud servers has happened again at IBM. Researchers at the security firm Eclypsium demonstrated how they could rent a server from a cloud computing provider, specifically focusing on IBM, and alter its firmware, allowing for hidden changes to persist even after the server is released and rented by another customer [81120]. (b) The software failure incident related to firmware alterations in cloud servers has not been explicitly mentioned to have happened at multiple organizations in the provided article.
Phase (Design/Operation) design, operation (a) The article discusses a software failure incident related to the design phase, specifically focusing on the firmware of cloud servers. Researchers at the security firm Eclypsium demonstrated how they could alter the firmware of IBM cloud servers, hiding changes to the code that persisted even after the server was released and rented by another customer [81120]. (b) The article also touches upon a software failure incident related to the operation phase. It mentions that IBM responded to the vulnerability by wiping its servers' BMC firmware between different customers' uses to address the issue. However, there were doubts about the effectiveness of this fix, with researchers still being able to perform their catch-and-release trick despite IBM's claimed actions [81120].
Boundary (Internal/External) within_system, outside_system (a) The software failure incident described in the article is within_system. The failure occurred due to the ability of attackers to alter the firmware of cloud servers, specifically in the case of IBM's bare metal cloud computing service. The incident involved researchers demonstrating how they could make changes to the firmware of servers, which persisted even after the servers were released back into the pool of available machines for other customers to rent [81120].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in the articles is related to non-human actions, specifically the ability to alter firmware in cloud servers without human intervention. Researchers at Eclypsium demonstrated how they could rent a server from a cloud computing provider, such as IBM, and alter its firmware, hiding changes to its code that persist even after the server is released and rented by another customer [81120]. This incident highlights a vulnerability in the cloud servers' firmware that can be exploited without direct human involvement, potentially leading to malicious activities like spying, data alteration, or destruction of the server. (b) The software failure incident is also related to human actions. While the vulnerability itself may have been introduced without direct human participation, the researchers at Eclypsium actively demonstrated the exploit by making benign alterations to the firmware of IBM servers in their testing [81120]. Additionally, IBM responded to the research findings by implementing a fix that involved wiping and reflashing the servers' BMC firmware between different customers' uses, indicating a human-initiated response to address the issue.
Dimension (Hardware/Software) hardware, software (a) The software failure incident related to hardware: The incident described in the article [81120] involves a software failure incident that is related to hardware. Specifically, the failure is due to vulnerabilities in the firmware of a hardware component called the baseboard management controller (BMC) in Super Micro servers offered by IBM in their bare metal cloud computing service. The compromised firmware in the BMC allows for malicious code to persist even after the server is released back into the pool of available machines for other customers. This hardware-related vulnerability poses a significant risk as it can lead to spying on the server, data alteration, or even destruction of the server by hackers. (b) The software failure incident related to software: The software failure incident described in the article [81120] is also related to software. While the root cause of the issue lies in the compromised firmware of the hardware component (BMC), the persistence of malicious code in the firmware highlights a software-related failure. The compromised firmware can be considered as a software failure since it allows for unauthorized access, manipulation of the operating system, and potential ransomware attacks, all of which are software-related consequences of the hardware vulnerability.
Objective (Malicious/Non-malicious) malicious, non-malicious (a) The software failure incident described in the articles is malicious in nature. The incident involved researchers demonstrating how they could rent a server from a cloud computing provider, alter its firmware, and hide changes to its code that persist even after they stop renting it. This technique could be used to plant malware in servers' hidden code, allowing hackers to spy on the server, alter its data, or destroy it at will [81120]. The researchers made benign changes to the IBM servers' firmware in their demonstration, but they warned that the same technique could be used for malicious purposes [81120]. (b) The incident is non-malicious in the sense that the researchers initially made harmless alterations to the BMC's firmware of the IBM bare metal cloud servers as part of their experiment [81120]. However, the concern raised was that it would be easy to hide truly malicious firmware using the same method, indicating the potential for non-malicious changes to be escalated to malicious ones [81120].
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions The software failure incident described in the articles can be categorized under both poor_decisions and accidental_decisions: (a) poor_decisions: The incident involved a vulnerability where researchers demonstrated the ability to alter firmware in cloud servers, potentially allowing for the planting of malware that persists undetected even after the server is used by another customer [81120]. (b) accidental_decisions: The vulnerability in the firmware of the cloud servers was not intentional but rather a result of the inherent security flaw in the design and management of the servers, which allowed for the persistence of firmware alterations [81120].
Capability (Incompetence/Accidental) development_incompetence, accidental (a) The article discusses a software failure incident related to development incompetence. The incident involved a vulnerability in the firmware of IBM's bare metal cloud servers, where researchers at Eclypsium demonstrated how they could alter the firmware of rented servers, allowing for the persistence of hidden malicious code even after the server is released back into the pool of available machines for other customers [81120]. This incident highlights a security flaw that was not addressed initially, showcasing a failure due to contributing factors introduced by a lack of professional competence in ensuring the security and integrity of the cloud servers' firmware. (b) The software failure incident can also be categorized as accidental. The vulnerability in the firmware of the cloud servers was not intentionally created by the cloud service provider or the customers renting the servers. It was a result of a flaw in the design or implementation of the servers' firmware that allowed for unauthorized alterations to persist even after the servers were released and rented by other customers [81120]. This accidental introduction of a security vulnerability showcases a failure due to contributing factors introduced accidentally, rather than through deliberate actions.
Duration temporary The software failure incident described in the article is more aligned with a temporary failure rather than a permanent one. The incident involved researchers at the security firm Eclypsium demonstrating how they could alter the firmware of cloud computing servers, specifically focusing on IBM servers, in a way that allowed changes to persist even after the server was released and rented by another customer [81120]. This indicates that the failure was temporary in nature, as it was caused by specific actions taken by the researchers rather than being an inherent, permanent flaw in the system.
Behaviour crash, other (a) crash: The article discusses a scenario where researchers demonstrated the ability to alter the firmware of cloud servers, which could potentially lead to malware being planted in the servers' hidden code. This alteration persists even after the server is released back into the pool of available machines for other customers, indicating a form of system crash where the system loses its state and may not perform its intended functions properly [81120]. (b) omission: The incident described in the article does not directly mention a failure due to the system omitting to perform its intended functions at an instance(s). (c) timing: The article does not specifically mention a failure due to the system performing its intended functions correctly but too late or too early. (d) value: The article does not provide information about a failure due to the system performing its intended functions incorrectly. (e) byzantine: The incident described in the article does not directly mention a failure due to the system behaving erroneously with inconsistent responses and interactions. (f) other: The behavior of the software failure incident described in the article can be categorized as a persistent firmware alteration that occurs even after the server is released back into the pool of available machines for other customers. This behavior could be considered a form of system persistence or system state retention beyond the expected or intended duration, which is not explicitly covered by the options (a) to (e) [81120].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property, non-human, theoretical_consequence (a) death: There is no mention of people losing their lives due to the software failure incident in the provided article [81120]. (b) harm: The article does not mention people being physically harmed due to the software failure incident [81120]. (c) basic: The article does not mention people's access to food or shelter being impacted because of the software failure incident [81120]. (d) property: The software failure incident impacted people's material goods, money, or data as it involved the potential risk of planting malware in servers' hidden code that could allow hackers to spy on the server, alter its data, or destroy it at will [81120]. (e) delay: The article does not mention people having to postpone an activity due to the software failure incident [81120]. (f) non-human: Non-human entities were impacted due to the software failure incident as it involved altering the firmware of cloud computing servers, specifically the baseboard management controller, which could lead to bricking computers or paralyzing them for a potential ransomware attack [81120]. (g) no_consequence: The article does not mention there were no real observed consequences of the software failure incident [81120]. (h) theoretical_consequence: The article discusses potential consequences of the software failure incident, such as the ability to hide truly malicious firmware in servers' hidden code that persists undetected even after someone else takes over the machine [81120]. (i) other: The article does not mention any other specific consequences of the software failure incident beyond those discussed in the options (a) to (h) [81120].
Domain information (a) The failed system in the article was related to the information industry, specifically cloud computing servers used by various organizations for different purposes such as video conference hosting, mobile payments, and neurological stimulation treatments [81120].

Sources

Back to List