Incident: Redfin Website Outage: Master Database Crash Impacting Home Shoppers

Published Date: 2012-08-18

Postmortem Analysis
Timeline 1. The software failure incident at Redfin happened on a Saturday morning, as mentioned in Article [13802]. 2. Article [13802] was published on 2012-08-18. 3. Estimating the timeline: - The incident occurred on a Saturday morning, which could be assumed to be the same day as the article was published. - Therefore, the software failure incident at Redfin occurred on 2012-08-18.
System The system that failed in the software failure incident reported in Article 13802 was: 1. Master database machine 2. Fallback to a different master 3. Backup procedure These components failed during the incident at Redfin [13802].
Responsible Organization 1. The master database machine crashing and the failure of the fallback to a different master were responsible for causing the software failure incident at Redfin [13802].
Impacted Organization 1. Redfin [13802]
Software Causes 1. The master database machine crashed, leading to the site outage [13802]. 2. The fallback to a different master database failed, exacerbating the issue [13802]. 3. The backup procedure that followed was more time-consuming [13802].
Non-software Causes 1. The master database machine crashed [13802]. 2. The fallback to a different master failed [13802]. 3. The next backup procedure was more time-consuming [13802].
Impacts 1. The software failure incident caused Redfin's website to be down for about four hours on a Saturday morning, affecting weekend home shoppers [13802].
Preventions 1. Implementing a more robust backup and failover system to ensure that in case of a master database crash, the fallback to a different master is successful [13802].
Fixes 1. Implementing a more robust backup procedure to ensure quicker recovery in case of a database crash [13802]. 2. Conducting a thorough investigation to identify the root cause of the initial problem that led to the database crash [13802].
References 1. Redfin's tweet explaining the issue with the master database [13802] 2. Statement sent by Redfin to CNET regarding the incident [13802]

Software Taxonomy of Faults

Category Option Rationale
Recurring unknown (a) The software failure incident at Redfin seems to be a unique occurrence for the company as there is no mention in the article of a similar incident happening before within the same organization or with its products and services. The statement from Redfin mentioned that they don't know yet what caused the initial problem and are looking into it, indicating that it may be a new issue for them [13802]. (b) The article does not provide any information about similar incidents happening at other organizations or with their products and services. Therefore, it is unknown if this type of software failure has occurred elsewhere.
Phase (Design/Operation) design (a) The software failure incident at Redfin was related to the design phase. The article mentions that the site's recent redesign was not the cause of the issue. Instead, the master database machine crashed, and the fallback to a different master failed, indicating a failure introduced by system development or updates [13802]. (b) The software failure incident at Redfin was not related to the operation phase. There is no mention in the article of the failure being caused by the operation or misuse of the system.
Boundary (Internal/External) within_system (a) within_system: The software failure incident at Redfin was due to issues within the system. The article mentions that the master database machine crashed, and the fallback to a different master failed, leading to the site being down for about four hours. Additionally, the backup procedure was more time-consuming, indicating internal system issues [13802].
Nature (Human/Non-human) non-human_actions (a) The software failure incident at Redfin was attributed to non-human actions. The article mentions that the master database machine crashed, and the fallback to a different master failed, leading to the site being down for about four hours. The issue was not related to the new page design but rather a technical failure with the database system [13802].
Dimension (Hardware/Software) hardware (a) The software failure incident at Redfin was attributed to hardware issues. The master database machine crashed, and the fallback to a different master failed, leading to the site being down for about four hours [13802].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident at Redfin was non-malicious. The article mentions that the site went down for several hours on a Saturday morning due to a master database machine crash and a failure in the fallback procedure, which was not related to the new page design. Redfin stated that they were investigating the cause of the initial problem, indicating that it was not a deliberate act to harm the system [13802].
Intent (Poor/Accidental Decisions) accidental_decisions (a) The software failure incident at Redfin was not due to poor decisions but rather an unexpected technical issue. The company mentioned that the site's downtime was caused by the master database machine crashing and the fallback to a different master failing, leading to a more time-consuming backup procedure. They also stated that they were investigating the root cause of the problem [13802].
Capability (Incompetence/Accidental) accidental (a) The software failure incident at Redfin was not attributed to development incompetence. The article mentions that the site went down for several hours on a Saturday morning due to the master database machine crashing and the fallback to a different master failing. The incident was described as a technical issue rather than a result of development incompetence [13802]. (b) The software failure incident at Redfin was categorized as accidental. The article states that the master database machine crashed, and the fallback to a different master failed, leading to the site being down for about four hours. The company mentioned that they were unsure about the initial cause of the problem but were investigating it [13802].
Duration temporary The software failure incident at Redfin was temporary. The site was down for about four hours, from 4:15 a.m. to about 8:15 a.m. PT, due to the master database machine crashing and the fallback to a different master failing. The next backup procedure was more time-consuming, indicating that the issue was resolved within a few hours [13802].
Behaviour crash, other (a) crash: The software failure incident at Redfin was a crash where the master database machine crashed, leading to the site being down for about four hours [13802]. (b) omission: There is no specific mention of the software failure incident being due to the system omitting to perform its intended functions at an instance(s) in the provided article. (c) timing: The software failure incident was not related to the system performing its intended functions too late or too early [13802]. (d) value: The software failure incident was not due to the system performing its intended functions incorrectly [13802]. (e) byzantine: The software failure incident did not involve the system behaving erroneously with inconsistent responses and interactions [13802]. (f) other: The software failure incident was specifically described as a crash related to the master database machine failure and the subsequent backup procedure being time-consuming [13802].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay The consequence of the software failure incident reported in Article 13802 was a delay. The article mentions that the Redfin website went down for about four hours on a Saturday morning, impacting weekend home shoppers who were unable to access the site during that time [13802].
Domain unknown (a) The failed system was related to the real estate industry, specifically online real estate brokerage, as mentioned in the article about Redfin's website going down for several hours on a Saturday morning [13802].

Sources

Back to List