Incident: Zoom Video Communications Inc. Experiences Widespread Outages Impacting Users

Published Date: 2020-05-17

Postmortem Analysis
Timeline 1. The software failure incident mentioned in Article 99927 happened on 2020-05-17. 2. The software failure incident mentioned in Article 103621 happened on 2020-08-24.
System 1. Zoom Video Communications platform [99927, 103621] 2. Zoom's website for signing up for paid accounts, upgrading, and managing services [103621]
Responsible Organization 1. Zoom Video Communications Inc. [Article 103621] 2. Unknown
Impacted Organization 1. Zoom users, including individuals unable to log in, host, or participate in meetings [99927] 2. Journalists who were unable to join live on-screen during a press conference due to Zoom issues [99927] 3. Users who couldn't sign up for paid accounts, upgrade, or manage their service on Zoom's website [103621]
Software Causes 1. The software failure incident in Article 99927 was caused by an issue that impacted a subset of Zoom users, preventing them from hosting and joining Zoom Meetings and Zoom Video Webinars [99927]. 2. The software failure incident in Article 103621 was characterized by technical difficulties that disrupted webinars, online meetings, and the ability for users to sign up for paid accounts, upgrade, or manage their service on Zoom's website. The specific cause of these technical difficulties was not disclosed by Zoom [103621].
Non-software Causes 1. Increased usage due to the surge in members amid the coronavirus pandemic [99927] 2. Vulnerabilities of virtual school during the coronavirus pandemic [103621]
Impacts 1. Users of the Zoom video conferencing service experienced audio and visual problems, with some unable to log in, host, or participate in meetings [99927]. 2. Around 3043 people reported problems with the service, mainly related to video conferencing and logging in [99927]. 3. Webinars and other online meetings were disrupted or inaccessible until around 1 p.m. in New York, impacting work meetings and online learning [103621]. 4. Users couldn't sign up for paid accounts, upgrade, or manage their service on Zoom's website [103621]. 5. The incident highlighted the vulnerabilities of virtual school during the coronavirus pandemic, affecting the ability of children to learn and employees to work [103621].
Preventions 1. Implementing robust testing procedures to identify and address potential issues before they impact users [99927]. 2. Enhancing server capacity and redundancy to handle increased usage during peak times, such as the start of the school day [103621].
Fixes 1. Implementing robust monitoring systems to quickly identify and address any issues that arise [99927]. 2. Conducting a thorough investigation to determine the root cause of the technical difficulties [103621]. 3. Enhancing the scalability and reliability of the platform to handle increased usage during peak times, such as the start of the school day [103621].
References 1. DownDetector [Article 99927] 2. Twitter users [Article 99927] 3. Zoom spokesperson [Article 99927] 4. Business Secretary Alok Sharma [Article 99927] 5. Zoom's tweet [Article 103621]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: - The incident with Zoom experiencing technical difficulties and disruptions in its service has occurred multiple times, as mentioned in both Article 99927 and Article 103621. Users faced issues with logging in, hosting, and participating in meetings, and the service was inaccessible for a period of time in both instances. - Zoom, the American communications company based in San Jose, California, has faced similar incidents of service disruptions and technical difficulties on different occasions, impacting its users' ability to use the platform effectively. (b) The software failure incident having happened again at multiple_organization: - There is no specific mention in the provided articles about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident related to the design phase can be seen in Article 103621, where it mentions that users couldn't sign up for paid accounts, upgrade, or manage their service on Zoom's website. This indicates a failure related to the system development or updates that affected the functionality of the platform [103621]. (b) The software failure incident related to the operation phase is evident in Article 99927, where it describes how Zoom users were unable to log in, host, or participate in meetings due to the platform going down. This failure was a result of the operation or use of the system by the users [99927].
Boundary (Internal/External) within_system (a) within_system: The software failure incident reported in the articles was primarily within the system. The incidents were related to issues such as users being unable to log in, host, or participate in meetings, disruptions in webinars and online meetings, and users not being able to sign up for paid accounts or manage their service on Zoom's website. These issues were internal to Zoom's platform and services, indicating a failure originating from within the system [99927, 103621]. (b) outside_system: There is no specific information in the articles indicating that the software failure incident was caused by contributing factors originating from outside the system. The focus of the incidents reported was on internal issues within Zoom's platform and services, rather than external factors impacting the system.
Nature (Human/Non-human) non-human_actions (a) The software failure incident occurring due to non-human actions: - The incident on Zoom was due to technical difficulties that disrupted webinars and online meetings, making them inaccessible until around 1 p.m. [Article 103621]. - Users also couldn't sign up for paid accounts, upgrade, or manage their service on Zoom's website during the outage [Article 103621]. (b) The software failure incident occurring due to human actions: - There is no specific mention in the articles about the software failure incident being caused by human actions.
Dimension (Hardware/Software) software (a) The articles do not provide any information indicating that the software failure incident was due to hardware issues. Therefore, it is unknown if hardware played a role in the incident. (b) The software failure incidents reported in the articles were due to issues originating in the software. The incidents included users being unable to log in, host, or participate in meetings, disruptions in webinars and online meetings, and users not being able to sign up for paid accounts or manage their service on Zoom's website. These issues were related to the software functionality and performance, leading to disruptions in service [99927, 103621].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident reported in the articles does not indicate any malicious intent behind the failure. The incidents seem to be non-malicious in nature, likely caused by technical difficulties or system issues rather than intentional harm to the system. The articles mention technical difficulties, outages, and disruptions in the Zoom service, impacting users' ability to host or join meetings, sign up for accounts, and manage services [99927, 103621]. These issues are more indicative of non-malicious software failures rather than intentional malicious actions.
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident related to Zoom experiencing outages and disruptions in its video conferencing service can be attributed to poor decisions. The incident was a result of technical difficulties that affected webinars, online meetings, and the ability for users to sign up for paid accounts or manage their services on Zoom's website [Article 103621]. The company did not disclose the cause of the technical difficulties, indicating a lack of transparency and potentially poor decision-making in communication with users about the issue. Additionally, the incident occurred on the first day of class for many U.S. students, highlighting the significant impact of the outage on virtual learning during the coronavirus pandemic [Article 103621].
Capability (Incompetence/Accidental) accidental (a) The software failure incident reported in the articles does not explicitly mention any issues related to development incompetence. The incidents seem to be more related to technical difficulties or outages rather than issues stemming from lack of professional competence in development. (b) The software failure incidents reported in the articles were due to technical difficulties or outages that occurred accidentally. For example, in Article 103621, it is mentioned that webinars and online meetings were disrupted or inaccessible, and users couldn't sign up for paid accounts or manage their service on Zoom's website. The cause of the technical difficulties was not disclosed, indicating that the incident was accidental in nature.
Duration temporary (a) The software failure incident was temporary as it caused disruptions for a certain period before being resolved. The incident in Article 99927 mentions that Zoom users began experiencing issues with the video service around 9 am, and the company confirmed that the matter was resolved later on. Similarly, in Article 103621, it is stated that webinars and online meetings were disrupted until around 1 p.m., indicating a temporary disruption before services were restored [99927, 103621].
Behaviour crash, omission (a) crash: The software failure incident in Article 99927 can be categorized as a crash. Users of the Zoom video conferencing service were unable to log in, host, or participate in meetings, indicating a failure of the system to perform its intended functions [99927]. (b) omission: The software failure incident in Article 103621 can be categorized as an omission. Users were unable to sign up for paid accounts, upgrade, or manage their service on Zoom's website, indicating an omission in performing its intended functions [103621]. (c) timing: The articles do not provide information indicating a timing-related failure. (d) value: The articles do not provide information indicating a value-related failure. (e) byzantine: The articles do not provide information indicating a byzantine-related failure. (f) other: The software failure incidents described in the articles can be categorized as a crash and an omission, respectively, based on the system losing state and not performing intended functions and omitting to perform intended functions [99927, 103621].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The articles do not mention any consequences such as death, physical harm, impact on access to food or shelter, impact on material goods or data, or harm to non-human entities due to the Zoom software failures. The incidents mainly caused delays in online meetings, webinars, and online learning activities [99927, 103621].
Domain information, health (a) The failed system was related to the information industry as it affected the video conferencing service provided by Zoom, which is used for virtual meetings, online learning, and communication purposes [99927, 103621]. (j) The health industry was also impacted by the software failure incident as Zoom is widely used for telemedicine, virtual consultations, and remote healthcare services, especially during the coronavirus pandemic [99927, 103621]. (m) The software failure incident is not related to any other industry mentioned in the options provided.

Sources

Back to List