Incident: Title: Firmware Bug in VOIP Servers Causes Network Controller Crashes

Published Date: 2013-02-08

Postmortem Analysis
Timeline 1. The software failure incident with the crashing servers due to a single packet of data sent by the Yealink SIP-T22P phones happened in the summer and was discovered by September [17085]. 2. Published on 2013-02-08. 3. The software failure incident likely occurred in the summer of 2012.
System 1. Intel 82574L network controller 2. Lex CompuTech motherboards 3. Yealink SIP-T22P phone model [17085]
Responsible Organization 1. Lex CompuTech (Taiwan's Lex CompuTech) was responsible for causing the software failure incident by using the wrong version of the Electrically Erasable Programmable Read-Only Memory (EEPROM) software for the controller setup that it shipped with its motherboards [17085].
Impacted Organization 1. Star2Star company [17085] 2. Customers who received the Linux-based Voice Over Internet Protocol (VOIP) servers from Star2Star [17085]
Software Causes 1. The software cause of the failure incident was a bug in the firmware of the Intel 82574L network controller, specifically in the Electrically Erasable Programmable Read-Only Memory (EEPROM) software used for the controller setup [17085].
Non-software Causes 1. The bug was in the firmware that Lex CompuTech shipped with its motherboards, specifically related to the Electrically Erasable Programmable Read-Only Memory (EEPROM) software setup [17085].
Impacts 1. The software failure incident caused the new servers to crash unexpectedly at 10 companies, with some experiencing crashes every day, leading to disruptions in their operations [17085]. 2. Restarting the servers after a crash required manual intervention by unplugging and plugging them back in, affecting the network connectivity and causing downtime [17085]. 3. The bug in the firmware of the motherboards' networking hardware resulted in the Intel 82574L network controller being knocked offline, impacting the server's Ethernet network connection [17085]. 4. The incident raised concerns about the potential widespread impact of the bug, as it was unclear how many other systems might be vulnerable to the same issue [17085]. 5. The bug could potentially be exploited by malicious actors to take down entire racks of servers that were affected by the faulty firmware, posing a security risk to organizations using the impacted hardware [17085].
Preventions 1. Proper testing and validation of the firmware image by the motherboard manufacturer, Lex CompuTech, before shipping the products could have prevented the incident [17085]. 2. Implementing robust quality control measures during the manufacturing process to ensure the correct EEPROM software version is programmed for the controller setup could have helped prevent the bug from occurring [17085]. 3. Regular monitoring and testing of network hardware and firmware for vulnerabilities and unusual behavior could have potentially detected the issue before it caused widespread crashes [17085].
Fixes 1. Updating the firmware image on the affected motherboards with the correct version of Electrically Erasable Programmable Read-Only Memory (EEPROM) software [17085].
References 1. Kristian Kielhofner, the Chief Technology Officer of Star2Star [17085] 2. Intel, the manufacturer of the networking hardware affected by the bug [17085] 3. Lex CompuTech (Synertron Technology in the U.S.), the company responsible for the firmware issue on the motherboards [17085]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident related to the crashing servers due to a single packet of data from the Yealink SIP-T22P phones was specifically attributed to an issue with the firmware on the motherboards made by Lex CompuTech, as confirmed by Intel [17085]. This incident was isolated to a specific type of motherboard that Kielhofner blogged about, indicating that it was not a widespread issue within Intel's networking controller that had been shipping for five years. (b) The articles do not provide information about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident described in the article was primarily due to a design issue. The root cause of the problem was traced back to a single packet of data being sent by a specific model of phone, the Yealink SIP-T22P, which was causing the VOIP servers to crash. This issue was further attributed to a firmware bug in the Electrically Erasable Programmable Read-Only Memory (EEPROM) software used by the networking hardware built by Intel but shipped with motherboards manufactured by Lex CompuTech [17085]. (b) The operation of the system did play a role in exacerbating the software failure incident. After each crash caused by the packet of data, the servers had to be manually unplugged, plugged back in, and restarted to get back online. This operational procedure was necessary due to the nature of the failure, where simply restarting the server or turning it off did not resolve the issue. Additionally, the potential impact of the bug on the operation of servers was highlighted, as it could be exploited to take down entire racks of servers that were affected by the buggy firmware [17085].
Boundary (Internal/External) within_system (a) The software failure incident described in the article is primarily within_system. The issue stemmed from a single packet of data sent by a specific model of phone, the Yealink SIP-T22P, which caused the VOIP servers to crash. The problem was traced back to a firmware bug in the Intel 82574L network controller, which was due to an incorrect EEPROM image programmed by the motherboard manufacturer, Lex CompuTech [17085].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in this case was primarily due to non-human actions. The root cause of the issue was traced back to a single packet of data being sent by a specific model of phone (Yealink SIP-T22P) to the VOIP servers, which caused the servers to crash. This packet of data affected the Intel 82574L network controller on the servers, leading to the crashes [17085]. (b) Human actions were also involved in the software failure incident. The firmware issue that caused the bug was attributed to the company that made the motherboards, Lex CompuTech. They used the wrong version of the Electrically Erasable Programmable Read-Only Memory (EEPROM) software for the controller setup, which ultimately led to the problem. Additionally, the detective work conducted by Kristian Kielhofner to identify the root cause of the crashes involved human effort and expertise [17085].
Dimension (Hardware/Software) hardware (a) The software failure incident in this case was primarily due to hardware issues. The incident was traced back to a bug in the firmware of the Intel 82574L network controller, which was caused by an incorrect EEPROM image programmed by the motherboard manufacturer, Lex CompuTech [17085]. The bug in the hardware caused the servers to crash when receiving a specific packet of data from the Yealink SIP-T22P phone model, leading to network connectivity issues and the need for manual intervention to restart the servers. (b) The software failure incident was not directly attributed to software issues but rather to a hardware bug in the firmware of the network controller. The software running on the servers, including the Linux-based VOIP servers, was functioning as intended until it encountered the problematic packet of data that triggered the hardware failure [17085].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article is non-malicious. The failure was caused by a bug in the firmware of the networking hardware, specifically the Intel 82574L network controller, due to an incorrect EEPROM image programmed during manufacturing by the motherboard manufacturer, Lex CompuTech [17085]. The incident was not a result of malicious intent but rather a technical flaw in the system.
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident described in the article was not due to poor decisions but rather an accidental decision or mistake. The root cause of the issue was traced back to a firmware bug in the Electrically Erasable Programmable Read-Only Memory (EEPROM) software that was programmed incorrectly during manufacturing by the motherboard manufacturer, Lex CompuTech. Intel confirmed that the bug was not its fault but was in the firmware that Lex shipped with its motherboards [17085].
Capability (Incompetence/Accidental) accidental (a) The software failure incident described in the article was not due to development incompetence. Instead, it was attributed to a specific bug in the firmware that was shipped with the motherboards by the company Lex CompuTech. Intel confirmed that the issue was related to an incorrect EEPROM image programmed during manufacturing by the motherboard vendor, not a design problem with the Intel networking controller [17085]. (b) The software failure incident was accidental in nature. It was caused by a single packet of data sent by a specific model of phone (Yealink SIP-T22P) that was crashing the VOIP servers. This packet of data was identified as the root cause of the crashes, and it was not intentionally created to cause harm but rather had an unintended impact on the server hardware [17085].
Duration temporary (a) The software failure incident described in the article was temporary. The servers were crashing due to a specific packet of data being sent by a particular model of phone, the Yealink SIP-T22P. This packet caused the servers to crash, specifically targeting the Intel 82574L network controller. Restarting the server or turning it off did not fix the issue, and the only solution was to unplug the server, plug it back in, and then restart the machine to get back on the network [17085]. This indicates that the failure was temporary and could be resolved by taking specific actions to address the root cause of the issue.
Behaviour crash, other (a) crash: The software failure incident described in the article resulted in crashes of the servers running the VOIP systems. The crashes were caused by a specific packet of data sent by a particular model of phone, which led to the system's Intel network controller being knocked offline and requiring manual intervention to restart the servers [17085]. (b) omission: There is no specific mention of the software failure incident being related to the system omitting to perform its intended functions at an instance(s) in the provided article. (c) timing: The software failure incident did not involve the system performing its intended functions too late or too early; rather, it led to unexpected crashes of the servers [17085]. (d) value: The failure was not related to the system performing its intended functions incorrectly; instead, it was about the system crashing due to a specific packet of data causing issues with the network controller [17085]. (e) byzantine: The software failure incident did not exhibit behavior of the system behaving erroneously with inconsistent responses and interactions; it was more focused on the specific cause of the crashes related to the network controller issue [17085]. (f) other: The other behavior exhibited in this software failure incident was the unique nature of the bug causing the crashes. The specific packet of data sent by a particular model of phone had a byte value that, when set to certain numbers, would either crash the controller or inoculate it against further packets of death until the server was powered off. This behavior was described as unusual and not commonly encountered in network troubleshooting [17085].

IoT System Layer

Layer Option Rationale
Perception network_communication, embedded_software The software failure incident described in the article was related to the network communication layer of the cyber physical system that failed. The failure was specifically attributed to a bug in the firmware of the Intel 82574L network controller, which was caused by an incorrect EEPROM image programmed by the motherboard manufacturer, Lex CompuTech [17085]. This bug led to the crashing of the VOIP servers due to a single packet of data sent by a specific model of phone, the Yealink SIP-T22P, which clobbered the network controller and required manual intervention to restart the servers [17085]. The issue was isolated to a specific motherboard design and was not a design problem with the Intel 82574L Gigabit Ethernet controller itself [17085].
Communication link_level The software failure incident described in the article [17085] was related to the communication layer of the cyber physical system that failed at the link_level. The failure was caused by a specific packet of data sent by the Yealink SIP-T22P phone model, which resulted in crashing the servers' Intel 82574L network controller. This issue was traced back to a firmware bug in the Electrically Erasable Programmable Read-Only Memory (EEPROM) software used by the motherboard manufacturer, Lex CompuTech. The bug was not in the Intel networking controller itself but in the firmware provided by Lex, indicating a failure at the link_level of the communication layer.
Application FALSE The software failure incident described in the article [17085] was not related to the application layer of the cyber physical system. Instead, it was attributed to a specific packet of data sent by a particular model of phone that caused the VOIP servers to crash due to a firmware issue in the motherboard's networking hardware. This issue was traced back to an incorrect EEPROM image programmed during manufacturing by the motherboard vendor, not due to bugs, operating system errors, unhandled exceptions, or incorrect usage at the application layer.

Other Details

Category Option Rationale
Consequence property, non-human, theoretical_consequence The consequence of the software failure incident described in the article [17085] was primarily related to property damage. The software bug caused the servers to crash, impacting the Intel 82574L network controller on the affected servers. This resulted in the need for manual intervention to unplug and restart the servers after each crash, affecting the functionality of the affected systems. Additionally, there was a potential theoretical consequence discussed where the bug could be exploited to take down an entire rack of servers, although there were no reported instances of this occurring.
Domain health The software failure incident described in the article was related to the telecommunications industry, specifically the Voice Over Internet Protocol (VOIP) sector. The failed system involved Linux-based VOIP servers that were shipped to various types of businesses such as doctor's offices, financial companies, and daycares [Article 17085].

Sources

Back to List