Incident: jQuery Upgrade Causing Commenting Functionality Failure on Guardian.co.uk

Published Date: 2015-08-16

Postmortem Analysis
Timeline 1. The software failure incident happened on Thursday morning between 9.30am and around noon [9155]. Estimation: Step 1: The incident occurred on a Thursday morning. Step 2: The article was published on 2015-08-16. Step 3: The incident likely occurred on a Thursday morning in August 2015.
System The system that failed in the software failure incident described in Article 9155 was: 1. jQuery version 1.6.4 - The upgrade to this specific version of jQuery caused a change in behavior in the .attr() function, leading to unexpected results in the code. 2. Discussion platform - The specific version of the Discussion platform that was running on the Release environment did not match the version running on Production, leading to the failure to detect the issue during testing.
Responsible Organization 1. The team responsible for deploying release 125 to production without thoroughly testing the impact of upgrading jQuery, leading to the software failure incident [9155].
Impacted Organization 1. Users trying to comment on guardian.co.uk [9155]
Software Causes 1. The software failure incident was caused by a change in the behavior of the .attr() function in the upgraded version of jQuery from 1.4.3 to 1.6.4, specifically affecting how the disabled attribute was handled in HTML elements [9155]. 2. The failure was exacerbated by not updating the code to use the more appropriate method $(element).removeAttr("disabled") instead of $(element).attr("disabled", "") [9155]. 3. The incident was also attributed to the decision to deploy the new Discussion features after the jQuery upgrade, leading to unexpected consequences due to the timing constraints [9155].
Non-software Causes 1. Lack of synchronization between the Discussion platform versions in the test environment and the production environment [9155]. 2. Timing constraints leading to a rushed deployment schedule [9155].
Impacts 1. Commenting was disabled on guardian.co.uk between 9.30am and around noon on Thursday, impacting user engagement and interaction [9155]. 2. The Submit button in the commenting form was impossible to click, hindering users from posting comments [9155]. 3. The software failure incident led to a delay in introducing new functionality as the issue needed to be fixed before proceeding with the deployment [9155]. 4. The upgrade to jQuery caused unexpected behavior in the code, affecting the functionality of the commenting platform [9155]. 5. The failure to spot the issue during regression testing highlighted a gap in testing environments and procedures, leading to a missed opportunity to catch the bug before deployment [9155].
Preventions 1. Conduct thorough regression testing on the exact version of software that will be deployed to production to catch any unexpected behavior or bugs [9155]. 2. Ensure that the test environment closely mirrors the production environment to minimize the risk of environment-specific failures not being found during testing [9155]. 3. Implement proper code reviews and testing procedures to catch potential issues with software upgrades, such as changes in third-party libraries like jQuery [9155].
Fixes 1. Implement thorough regression testing on all key functionalities, including posting a comment, to catch any unexpected behavior introduced by software updates [9155]. 2. Ensure that the test environment mirrors the final target or production environment as closely as possible to minimize the risk of environment-specific failures not being found in testing [9155]. 3. Review and update code to use the correct method, such as $(element).removeAttr("disabled"), to ensure proper functionality after software updates [9155].
References 1. Interviews with Matt and Gideon from the guardian.co.uk team [9155]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: - The incident described in Article 9155 involved a software failure related to a jQuery upgrade causing issues with the commenting functionality on the Guardian website. This incident was a result of a bug in the code that disabled the Submit button for comments due to changes in the behavior of the .attr() function after upgrading jQuery. The issue was not caught during testing due to discrepancies between the test environment and the production environment, leading to the failure being experienced by users [9155]. (b) unknown
Phase (Design/Operation) design (a) The software failure incident in Article 9155 occurred due to contributing factors introduced during the system development phase. Specifically, the failure was caused by a bug related to a change in the jQuery framework during a software update. The upgrade to jQuery version 1.6.4 introduced a change in behavior that affected the way the .attr() function worked, leading to unexpected issues with enabling/disabling elements on the website. The bug was known before deployment, but due to timing constraints and the order of deployment of different features, the issue was not addressed in time, resulting in the software failure incident [9155]. (b) The software failure incident in Article 9155 was not directly related to the operation or misuse of the system. Instead, it was primarily a result of a bug introduced during the development phase and the deployment process. The incident highlighted the importance of proper testing environments and the need for thorough regression testing to catch such issues before they impact the live system [9155].
Boundary (Internal/External) within_system, outside_system (a) The software failure incident described in the article was primarily within the system. The issue originated from a change in the jQuery library that affected the behavior of the .attr() function within the system's codebase. This internal change led to unexpected behavior in the commenting platform, causing the Submit button to not enable as intended, ultimately resulting in the failure to post comments [9155]. (b) Additionally, there was a contributing factor outside the system that played a role in the incident. The failure to catch this issue during regression testing was partly due to the discrepancy between the versions of the Discussion platform running on the test environment and the production environment. Upgrading the Discussion platform on the release environment without aligning it with the production environment led to a false sense of security during testing, contributing to the failure not being identified earlier [9155].
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident in this case was primarily due to non-human actions, specifically the upgrade of the jQuery framework from version 1.4.3 to 1.6.4. This upgrade led to a change in the behavior of the .attr() function in jQuery, causing unexpected issues with the Submit button functionality on the commenting platform. The change in behavior of the function was a non-human factor that contributed to the failure incident [9155]. (b) However, human actions also played a role in this software failure incident. The decision to push the new Discussion features after the jQuery upgrade, instead of before, due to timing constraints, was a human action that led to the unexpected earlier deployment of the new functionality. Additionally, the mistake of upgrading the version of Discussion on the Release environment without ensuring it matched the Production environment violated a key testing rule, which was a human error contributing to the incident [9155].
Dimension (Hardware/Software) software (a) The software failure incident in Article 9155 was not due to hardware issues but rather originated in software. The incident was caused by a bug related to a change in the behavior of the jQuery framework after an upgrade. Specifically, the upgrade to jQuery version 1.6.4 affected the way the .attr() function worked, leading to unexpected behavior in the code that disabled/enabled elements on the website. This software bug resulted in the commenting feature being disabled and the "Submit" button becoming unclickable, ultimately impacting user experience on the website [9155].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident described in the article is non-malicious. It was caused by a bug introduced during a routine software update where the behavior of a jQuery function changed unexpectedly, leading to the disabling of the commenting feature on the website. The incident was a result of unintended consequences of the software upgrade and the failure to properly test the changes in a staging environment that mirrored the production environment [9155].
Intent (Poor/Accidental Decisions) poor_decisions (a) The software failure incident described in Article 9155 was primarily due to poor_decisions. The incident occurred because of a decision to upgrade the version of jQuery, a Javascript framework, from version 1.4.3 to 1.6.4. This upgrade led to a change in the behavior of the .attr() function in jQuery, causing unexpected issues with the commenting platform. Additionally, there was a poor decision made regarding the timing of deploying the fix for the issue, which resulted in the failure being introduced into the production environment before the fix was implemented [9155].
Capability (Incompetence/Accidental) accidental (a) The software failure incident in this case was not due to development incompetence but rather due to a series of mistakes and oversights during the deployment process. The team was aware of the bug related to the jQuery upgrade but failed to address it properly in their code before deploying the new version. This highlights the importance of thorough testing and ensuring that changes are properly implemented before deployment [9155]. (b) The software failure incident can be categorized as accidental as it was not intentional but rather a result of a series of unfortunate events and oversights during the deployment process. The team did not anticipate the impact of the jQuery upgrade on their code and failed to catch the issue during their regression testing due to differences between the test environment and the production environment [9155].
Duration temporary (a) The software failure incident described in the article was temporary. The issue occurred after a software update was deployed, specifically related to a change in the behavior of the jQuery library's .attr() function. This change led to unexpected behavior in the commenting platform, causing the Submit button to not enable properly, thus preventing users from posting comments. The incident was not a permanent failure but rather a temporary issue resulting from the specific circumstances surrounding the jQuery upgrade and the timing of deploying the fix for the issue [9155].
Behaviour omission, value (a) crash: The software failure incident described in the article does not involve a crash where the system loses state and does not perform any of its intended functions. The issue was related to the incorrect behavior of the software after a jQuery upgrade, specifically affecting the commenting functionality on the website [9155]. (b) omission: The software failure incident can be categorized under omission. The failure occurred because the system omitted to perform its intended functions due to a change in the behavior of the jQuery library after an upgrade. This led to the Submit button for comments not being enabled as expected, resulting in users being unable to post comments [9155]. (c) timing: The software failure incident does not fall under the category of timing issues where the system performs its intended functions but at the wrong time. The problem was not related to the timing of the system's actions but rather to the incorrect behavior caused by the jQuery upgrade [9155]. (d) value: The software failure incident aligns with the value category. The failure occurred because the system was performing its intended functions incorrectly after the jQuery upgrade. Specifically, the behavior of the .attr() function in jQuery changed, leading to unexpected results in the code that disabled and enabled the Submit button for comments [9155]. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure where the system behaves erroneously with inconsistent responses and interactions. The issue described in the article was more straightforward, involving a specific change in behavior due to a jQuery upgrade that led to the failure of the commenting functionality [9155]. (f) other: The software failure incident does not fit into the other category. The failure was clearly explained in terms of the specific change in behavior caused by the jQuery upgrade and how it impacted the functionality of the commenting system on the website [9155].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence property The consequence of the software failure incident described in the articles is as follows: (d) property: People's material goods, money, or data was impacted due to the software failure. The software failure incident led to commenting being disabled on the website, specifically affecting the functionality of the commenting platform. This issue prevented users from posting comments, impacting their ability to engage with the platform as intended [9155].
Domain information (a) The failed system was related to the information industry as it involved the commenting functionality on the guardian.co.uk website, which is a platform for sharing news and information [9155].

Sources

Back to List