Incident: Pokémon Go Fest 2017: Network Overload Causes Unplayable Game Experience

Published Date: 2017-07-22

Postmortem Analysis
Timeline 1. The software failure incident happened on July 22, 2017, during the Pokémon Go Fest in Chicago as reported in Article [61095, 61493].
System 1. Mobile phone networks [61095] 2. Servers and technical issues [61493]
Responsible Organization 1. The overloaded mobile phone networks were responsible for causing the software failure incident at the Pokémon Go Fest [61095]. 2. Technical issues such as spotty cellular connections, bugs causing the game to crash, and problems with player authentication at log on were also contributing factors to the software failure incident [61493].
Impacted Organization 1. Attendees of Pokémon Go Fest in Chicago [61095, 61493] 2. Developer Niantic [61095, 61493]
Software Causes 1. Overloaded mobile phone networks causing high-bandwidth connections to swamp and collapse, leading to players being unable to log in [61095]. 2. Spotty cellular connections, bugs causing the game to crash, and problems with authenticating players at log on [61493].
Non-software Causes 1. Overloaded mobile phone networks [61095] 2. Spotty cellular connections [61493] 3. Long lines outside Chicago's Grant Park [61493] 4. Problems with authenticating players at log on [61493]
Impacts 1. Attendees at the Pokémon Go Fest experienced difficulties accessing the game due to overloaded mobile phone networks, leading to an almost unplayable experience and long queues [61095, 61493]. 2. The event had to be paused, and Niantic offered full refunds on the $20 ticket, $100 worth of in-game Pokécoins, and a Lugia to all registered attendees as compensation [61095, 61493]. 3. The CEO of Niantic was booed on stage, and attendees chanted "We can't play" due to the issues faced during the event [61095, 61493]. 4. The livestream of the event had to go into placeholder mode, and there were problems with spotty cellular connections, game crashes, and player authentication issues [61493]. 5. Niantic extended the radius of the area containing rare Pokémon to try and ease congestion on the phone networks [61095].
Preventions 1. Proper load testing and capacity planning to ensure the game servers could handle the expected number of players and high-bandwidth connections [61095]. 2. Implementing failover mechanisms or backup servers to prevent complete service disruption in case of network overload [61095]. 3. Providing better communication and updates to attendees about the technical issues and potential solutions in real-time to manage expectations and reduce frustration [61493]. 4. Conducting thorough testing and bug fixing before launching major events to identify and address any potential issues that could impact user experience [61493].
Fixes 1. Improving server capacity and network infrastructure to handle high-bandwidth connections and prevent network overload [61095]. 2. Conducting thorough testing and optimization of the game to address bugs that cause crashes and authentication issues [61493]. 3. Implementing measures to enhance player experience during events, such as extending the radius of rare Pokémon spawns to reduce congestion and offering compensation to attendees affected by the failure [61095, 61493].
References 1. Attendees at Pokémon Go Fest in Chicago's Grant Park [61095] 2. Niantic CEO John Hanke [61493]

Software Taxonomy of Faults

Category Option Rationale
Recurring one_organization (a) The software failure incident having happened again at one_organization: The incident at Pokémon Go Fest in Chicago in 2017 was not the first time Niantic, the developer of Pokémon Go, faced technical issues with their game. The article mentions that the game's debut in July 2016 was also marred by overloaded servers and technical problems, causing Niantic Labs to pause the rollout while addressing the issues [Article 61493]. (b) The software failure incident having happened again at multiple_organization: The articles do not provide information about similar incidents happening at other organizations or with their products and services.
Phase (Design/Operation) design, operation (a) The software failure incident at Pokémon Go Fest can be attributed to design-related factors introduced during system development and updates. The overloaded mobile phone networks and high-bandwidth connections swamping and collapsing the networks were key design issues that led to the failure [61095]. Additionally, the challenges with spotty cellular connections, bugs causing crashes, and problems with player authentication at log on were also mentioned as contributing factors introduced during system development [61493]. (b) The software failure incident at Pokémon Go Fest also involved operation-related factors, particularly issues related to the operation or misuse of the system. Attendees complained about being unable to log in, long lines causing delays, and missing significant parts of the event due to operational challenges [61095, 61493]. The problems with authenticating players at log on can also be considered an operational issue that impacted the user experience during the event [61493].
Boundary (Internal/External) within_system (a) within_system: The software failure incident at Pokémon Go Fest was primarily within the system. The issues were related to overloaded mobile phone networks, spotty cellular connections, bugs causing the game to crash, and problems with authenticating players at log on [61095, 61493]. These factors originated from within the system itself, such as the game's infrastructure and technical setup, leading to the failure experienced by the attendees at the event.
Nature (Human/Non-human) non-human_actions, human_actions (a) The software failure incident occurring due to non-human actions: - The software failure incident at Pokémon Go Fest was primarily due to overloaded mobile phone networks, which made the game almost unplayable for attendees who had queued for hours to catch Legendary Pokémon Lugia [61095]. - The issues at the event were related to spotty cellular connections, bugs causing the game to crash, and problems with authenticating players at log on [61493]. (b) The software failure incident occurring due to human actions: - The problems at Pokémon Go Fest were exacerbated by the high concentration of high-bandwidth connections swamping and collapsing the mobile networks in the area, leading to players being unable to log in [61095]. - Attendees at the event complained about long lines outside Grant Park, which caused delays and made them miss significant parts of the event [61493]. - Niantic CEO John Hanke was greeted with boos and chants of "We can't play" when he took to the stage to address the attendees, indicating dissatisfaction with the handling of the situation [61493].
Dimension (Hardware/Software) hardware, software (a) The software failure incident occurring due to hardware: - The software failure incident at Pokémon Go Fest was primarily due to overloaded mobile phone networks, which caused the game to be almost unplayable for attendees who had queued for hours [61095]. - Attendees complained about being unable to log in to the game as the concentration of high-bandwidth connections swamped and collapsed the mobile networks in the area [61095]. - The issues with the event were attributed to spotty cellular connections, some bugs that could cause the game to crash, and problems with authenticating players at log on, all of which are hardware-related factors [61493]. (b) The software failure incident occurring due to software: - The software failure incident at Pokémon Go Fest was exacerbated by software-related issues such as bugs that could cause the game to crash and problems with authenticating players at log on [61493]. - Attendees reported spending more time trying to get the game to load than actually playing it, indicating software-related performance issues [61095].
Objective (Malicious/Non-malicious) non-malicious (a) The software failure incident related to the Pokémon Go Fest can be categorized as non-malicious. The failure was primarily due to overloaded mobile phone networks, which resulted in the game being almost unplayable for attendees who had queued for hours to catch Legendary Pokémon Lugia [61095]. The issues were related to spotty cellular connections, bugs causing the game to crash, and problems with authenticating players at log on [61493]. The developer, Niantic, acknowledged the issues and offered refunds, in-game currency, and other compensations to the attendees affected by the failure.
Intent (Poor/Accidental Decisions) poor_decisions, accidental_decisions (a) The software failure incident at Pokémon Go Fest can be attributed to poor decisions made by the event organizers and developers. The event suffered from overloaded mobile phone networks, causing the game to be almost unplayable for attendees who had queued for hours to participate [61095]. The concentration of high-bandwidth connections swamped and collapsed the mobile networks in the area, leading to widespread complaints from players who were unable to log in. Despite acknowledging the issue later in the day, the chief executive of Niantic was booed on stage by the audience, expressing their frustration with the situation [61095]. Additionally, the event organizers extended the radius of the area containing rare Pokémon in an attempt to ease congestion on the phone networks [61095]. (b) The software failure incident at Pokémon Go Fest can also be linked to accidental decisions or unintended consequences. Attendees faced technical issues such as spotty cellular connections, bugs causing the game to crash, and problems with player authentication at log on [61493]. These issues led to significant disruptions during the event, with attendees expressing their frustration on social media platforms and in person at the event. Despite attempts to address the problems and provide compensation to attendees, the event faced challenges in delivering a smooth and enjoyable experience for players [61493].
Capability (Incompetence/Accidental) development_incompetence (a) The software failure incident related to development incompetence can be seen in the articles. The incident at Pokémon Go Fest was primarily caused by overloaded mobile phone networks, leading to the game being almost unplayable for attendees who had queued for hours to catch Legendary Pokémon Lugia [61095]. Additionally, the problems at the event were attributed to spotty cellular connections, bugs causing the game to crash, and issues with authenticating players at log on [61493]. These issues point towards a lack of professional competence in handling the technical aspects of the event and the game's infrastructure, contributing to the software failure incident.
Duration temporary (a) The software failure incident reported in the articles was temporary. The incident at the Pokémon Go Fest in Chicago was due to overloaded mobile phone networks, spotty cellular connections, bugs causing the game to crash, and problems with authenticating players at log on. Niantic acknowledged the issues and took steps to address them during the event, such as extending the radius of the area containing rare Pokémon and offering refunds and in-game currency to attendees [61095, 61493].
Behaviour crash, omission, value, other (a) crash: The software failure incident at the Pokémon Go Fest can be categorized as a crash. Attendees experienced the game being almost unplayable due to overloaded mobile phone networks, leading to players being unable to log in and the game not functioning as intended [61095]. (b) omission: The software failure incident can also be categorized as an omission. Attendees complained that they spent more time trying to get the game to load than actually playing it, indicating that the system omitted to perform its intended functions at that instance [61095]. (c) timing: The software failure incident does not seem to be related to timing issues. The problems faced by attendees were more about the system not functioning properly rather than performing its functions too late or too early [unknown]. (d) value: The software failure incident can be categorized as a value issue. The issues experienced by attendees, such as being unable to access the mobile app, long lines causing delays, and bugs causing crashes, indicate that the system was performing its intended functions incorrectly [61493]. (e) byzantine: The software failure incident does not exhibit characteristics of a byzantine failure. The issues faced by attendees were more related to technical problems and network overload rather than inconsistent responses or interactions [unknown]. (f) other: The software failure incident can be categorized as an "other" behavior. The incident involved a combination of system crashes, omissions, and incorrect functioning due to overloaded mobile networks, bugs, and authentication problems, leading to a chaotic and unplayable experience for attendees [61095, 61493].

IoT System Layer

Layer Option Rationale
Perception None None
Communication None None
Application None None

Other Details

Category Option Rationale
Consequence delay (a) death: People lost their lives due to the software failure (b) harm: People were physically harmed due to the software failure (c) basic: People's access to food or shelter was impacted because of the software failure (d) property: People's material goods, money, or data was impacted due to the software failure (e) delay: People had to postpone an activity due to the software failure (f) non-human: Non-human entities were impacted due to the software failure (g) no_consequence: There were no real observed consequences of the software failure (h) theoretical_consequence: There were potential consequences discussed of the software failure that did not occur (i) other: Was there consequence(s) of the software failure not described in the (a to h) options? What is the other consequence(s)? The articles do not mention any consequences related to death, physical harm, impact on access to food or shelter, impact on material goods, money, or data, or harm to non-human entities due to the software failure incident at the Pokémon Go Fest. The main consequence observed was a delay in the event, with attendees unable to play the game as intended and missing out on significant activities [61095, 61493].
Domain entertainment (a) The failed system in the articles is related to the entertainment industry. The software failure incident occurred during the Pokémon Go Fest, which was a public celebration of the game's first anniversary and featured challenges for players, rare Pokémon, and the opportunity to capture legendary Pokémon like Lugia and Articuno [61095, 61493].

Sources

Back to List